The proliferation of GenAI tools continues to compel us to critically reassess how we gauge success in the modern digital age. Like other transformative technologies before it, the rise of AI necessitates a shift in our focus. The future vitality of the internet and the broader tech ecosystem will no longer be solely defined by metrics of success outlined in the 90s or early 00s. Instead, the emphasis is increasingly on the caliber of data, the reliability of information, and the incredibly vital role of expert communities and individuals in meticulously creating, sharing and curating knowledge.
In the light of that new world, we’re kicking off this new blog series focused on the challenges we face in determining how to evaluate the quality of internal and external datasets.
Data acquisition, the process of gathering information for analysis, forms the foundation for informed decision-making across numerous fields. However, the sheer volume of data available today can be overwhelming. This post explores crucial lessons learned in the trenches of data licensing, drawing insights from Stack Overflow and the growing importance of socially responsible data practices in a changing internet landscape.
Garbage in, garbage out
The old adage "garbage in, garbage out" is more relevant than ever when it comes to data acquisition. Collecting vast amounts of data is futile, even detrimental, if that data is irrelevant, inaccurate, or poorly structured. Storing, transferring, and processing data costs money, so if you start with a mountain of bad data, you’ll pay more to get it close to good—if that’s even possible.
As discussed in numerous posts here from our team at Stack Overflow, the focus should always be on identifying and acquiring the right data. This is particularly important in the age of AI, where the quality of the training data directly impacts the performance of AI models and opens new research opportunities. As our CEO Prashanth Chandrasekar noted during his time at HumanX, “When people put their neck on the line by using these AI tools, they want to make sure they can rely on it. By providing attribution in links and citations, you're grounding these AI answers in real truth.”
Furthermore, the principles of socially responsible AI emphasize the need for datasets that are free from bias or makes that bias known, promote accuracy, and directly link back and attribute to high-quality, well-curated datasets and experts.
Understanding what makes for quality data can save you time and money. Satish Jayanthi, CTO and co-founder of Coalesce, told us, “There are a lot of aspects to data quality. There is accuracy and completeness. Is it relevant? Is it standardized?” Depending on your use case, there may be more or different aspects to data quality for you to consider.
Key considerations before you begin this path:
- Define your objectives: Before collecting any data, clearly define the questions you need to answer or the problems you aim to solve. This will guide your data selection process.
- Prioritize quality over quantity: A smaller, high-quality dataset is far more valuable than a massive collection of unreliable information. Invest time in understanding your data source and its limitations.
- Understand data types and structures: Different data types (e.g., numerical, categorical, textual) require different processing techniques. Knowing the structure of your data upfront will streamline analysis.
- Implement data validation: Establish mechanisms to check the accuracy, completeness, and consistency of your data as it's being acquired. This can involve range checks, format validation, and cross-referencing with other sources.
The more effort you spend evaluating your data, the better. Spending most of your time questioning the accuracy of the data you have acquired is a sign you didn't think critically about what you actually needed.
This is why the Stack Overflow platform is so powerful. Our strict moderation policies and rich user feedback signals provide a reliable source of truth, high-quality knowledge, and verified technical (and non-technical) expertise expressed in natural language ideal for LLM training. When we used our public dataset to finetune two LLM models, we saw a 17% increase in technical accuracy in Q&A. We know from our own experience that based on tests, fine-tuning on Stack Overflow data results in substantial LLM performance improvements.
The power of third-party data
While internal data provides valuable insights into your own operations, leveraging third-party data can significantly broaden your understanding of the external landscape. In an evolving industry with shifting business models, the insights gained from diverse third-party sources become even more critical. One of the most crucial sources of these types of data is active, passionate, and trustworthy communities like Stack Overflow. As we outlined in an earlier blog post: the survival of user communities depends on creators that progressively create new and relevant content that serves as domain-specific, high-quality, validated, and trustworthy data. It also leans heavily on ethical, responsible use of that data for community good and reinvestment in the communities that develop and curate these knowledge bases. These community-based businesses will be successful if they can deploy and structure their content for consumption at scale, identify and support end-product use cases for their content by third parties, and deliver ROI for enterprises. They must also establish a network of enterprise relationships and procure and deploy relevant datasets to build and optimize for end users (in Stack’s case, developers). In the long term, they will sustain these businesses with their ability to create new data sources, protect their existing ones from unpermitted commercial use and abuse, and continue macroeconomics conducive to selling access to data or building tools based on knowledge, content, and data.
Advantages of using third-party data:
- Knowledge gaps: Provide content that fills a specific knowledge gap you may have in your product or organization.
- Competitive intelligence: Gain insights into your competitors' strategies, pricing, and market share.
- Market trends: Identify emerging trends, shifts in consumer behavior, and macroeconomic factors impacting your industry.
- Enriched customer profiles: Supplement your internal customer data with demographic, psychographic, and behavioral information from external sources for a more holistic view.
- Risk assessment: Access data on creditworthiness, fraud indicators, and regulatory compliance to mitigate potential risks.
- Geospatial insights: Incorporate location-based data for market analysis, logistics optimization, and targeted marketing.
However, integrating third-party data comes with its own set of challenges, including data quality inconsistencies, integration complexities, and compliance needs. When considering third-party AI APIs or data sources, it's crucial to evaluate their commitment to socially responsible AI principles, ensuring alignment with ethical considerations and fairness.
How to best leverage third-party data
Effectively leveraging third-party data requires a strategic approach and careful execution. Here are some best practices for using third-party data to support your business goals:
- Clearly define use cases: Align your third-party data needs with the objectives you aligned on in the first section. Identify specific business problems or opportunities that third-party data can address. Avoid acquiring data without a clear purpose.
- Evaluate data sources rigorously: Assess the reliability, accuracy, and relevance of potential data providers. Look for transparent methodologies and strong data governance practices. Inquire about their data sourcing and bias mitigation strategies to align with socially responsible AI practices.
- Plan data integration: Your third-party data will need to play nice with your existing systems and internal datasets. Consider API formats and costs, data warehouse scaling, and new and existing ETL (extract, transform, load) processes. Pay close attention to data formats, schemas, and units of measurement.
- Address data privacy and compliance: Users expect the data that they give you doesn’t leak out of your systems; data privacy regulations (e.g., GDPR, CCPA) will punish you if you’re careless. Ensure that your use of third-party data complies with all applicable laws and ethical guidelines. Secure necessary permissions and anonymize data when required.
- Start small and iterate: The Agile principles of failing fast apply to data projects, too. Begin with pilot projects to test the value and feasibility of integrating specific third-party datasets before committing to large-scale implementations.
- Continuously monitor and evaluate: Regularly assess the performance and ROI of your third-party data integrations. Data sources and their quality can change over time.
Stack Overflow's own experience in developing tools like Question Assistant highlights the importance of data quality and careful data handling. Question Assistant, which uses AI to help users clarify or improve the quality of their questions before posting, demonstrates how AI can be leveraged to ensure data entering a system is of high quality through ensuring the right question is asked to get the necessary answer.
Understanding data acquisition involves a shift from simply collecting data to strategically acquiring the right data, both internal and external. By prioritizing data quality, carefully evaluating third-party sources, and implementing robust integration strategies, organizations can transform raw information into actionable insights—a sentiment we’ve echoed time and again here at Stack Overflow.
This exploration into the fundamentals of data acquisition is just the beginning. In future posts across this series, our data science and Knowledge Solutions teams will dive deeper into building robust data strategies. We'll tackle crucial topics like the importance of data diversity, the essential dos and don'ts of data analysis, and data security best practices for strong, accurate, and protected datasets. We'll explore the practicalities of using data sets effectively (and ineffectively), delve into the strategic advantages of third-party data, and examine how platforms like Stack Overflow bolster non-coding tools. We'll do our best to demystify APIs and data models, address real-world market needs beyond the tech giants, and compare internal, third-party, and synthetic data, including their ideal use cases and how they can be combined for stronger models and outputs.
We are all building this next phase of the internet together. If you have thoughts, questions, or best practices to add to the conversation, please reach out.