The Case for Data-Driven Open Source Development

The Case for Data-Driven Open Source Development

The Case for Data-Driven Open Source Development

Eiso Kant

Eiso Kant is the CEO and co-founder of source{d} where he considers himself very privileged to work alongside an incredible team. Together they are focused on bringing Code As Data and Machine Learning on Code to developers and engineering leaders.

Every year the number of Open Source companies and developer communities continues to grow. Open Source is becoming the de facto standard for software development as companies realize the cost, agility and innovation benefits. In addition to embracing Linux, Microsoft recently open sourced its entire patent portfolio to all member of the Open Invention Network. Companies are not only hiring engineers based on their Open Source Software (OSS) knowledge but also allocating 100 percent of their time to external projects. As a result, these projects quality and feature sets improve significantly which further accelerates their adoption in the enterprise. Very successful Open Source projects such as Kubernetes have helped define best practices for contributions (both technical and non-technical), communication (both online and offline), openness (Summits, Special Interest Groups, etc.) and governance (maintainer-ship, Technical Advisory Board, etc.). No need to reinvent the wheel, there are well-established frameworks for companies to work with.

There is, however, one major problem that needs to be addressed: the lack of standardized metrics, datasets, methodologies and tools for extracting insights from Open Source projects is real.

Open Source Metrics That Actually Matter

Let’s take a look at the first part of the problem: the metrics. OSS Project stakeholders simply don’t have the data to make informed decisions, identify trends and forecast issues before they arise. Providing standardized and language agnostic data to all stakeholders is essential for the success of the Open Source industry as a whole. Everyone can query the API of large code repositories such as GitHub and GitLab and find interesting metrics but this approach has limitations. The data points you can pull are not always available, complete or structured properly. There are also some public datasets such as GH Archive but these datasets are not optimized for exhaustive querying of commits, issues, PRs, reviews, comments, across a large set of distributed git repositories.

Retrieving source code from a mono-repository is an easier task, but code retrieval at scale is a pain point for researchers, Open Source maintainers or managers who want to track individual or team contributions. The Open Source community is starting to recognize this challenge and the limitations of existing datasets. For instance, Public Git Archive now provides over 4TB of source code including hundreds of programming languages in a format more suited for analysis by academic researchers which is definitely a step in the right direction

It seems like the broader Open Source community is lacking both quantitative and qualitative data to measure Open Source success effectively. Beyond very basic metrics such as the number of stars, commits or contributors, maintainers should be able to know the average time to merge/close or the average time spent reviewing closed pull requests (PRs). Similarly, maintainers would benefit from a deeper understanding of the patterns associated with merged PRs or sentiment analysis tied to closed PRs. Such metrics could also be used to reward top contributors, Identify new maintainers candidates or project bottlenecks before they actually arise.

Embracing the Concept of ‘Code as Data’

Beyond the quality and quantity of OSS data and source code datasets, the biggest challenge to effective Open Source data analysis might be the lack of existing frameworks or methodology. OSS project maintainers might be wondering about the volume, frequency, origin and significance of external contributions as well as overall contributor experience but they don’t necessarily know what metrics to track or how to track them effectively. Although organizations such as the Community Health Analytics Open Source Software (CHAOSS) are currently tackling this problem, this is still a fairly new and underdeveloped discipline.

The concept of “Code as Data” which consists in treating source code as an analyzable dataset providing valuable business insights is starting to become popular with large multinationals. Enterprises are starting to realize its benefits but often underestimate some of its challenges. First, the lack of proper documentation or absence of compliance to engineering rules and processes known as “shadow IT” prevents IT leaders from knowing and measuring what’s in their codebases. The variety of programming languages, frameworks, versions and technical debt accumulated overtime makes large-scale code analysis not only complex but also time-consuming.

At the industry level, everyone would benefit from applying proven methodologies such as Master Data Management (MDM) to their codebases. It would help define standard processes for source code retrieval and analysis that would unlock industry-wide opportunities. MDM for source code could be the Open Source foundation the industry needs to prevent vendor lock-in while allowing companies to build proprietary platforms or applications on top of it.

Investing in Machine Learning on Code Tooling

According to a recent Stripe report, the “economic impact of developers dealing with bad code, debugging, refactoring, etc.” adds up to a shocking $85 Billion annually.

In the light of this number, improving quality of their code bases through metrics should be top of mind for companies as they try to modernize their codebases and software development practices. Yet, there seems to be no easy way for companies to retrieve, parse and query source code itself and its history over time. The number of lines of code keeps growing at an exponential rate but developer tools have not evolved rapidly enough to help developers deal with its intricacy and variety.

Many data scientists and developers agree that Machine Learning is the answer to this large-scale code and data analysis challenge. Machine Learning algorithms could be used to improve the way we write, read, and maintain our codebases. The notion of Machine Learning on Code (ML on Code) is gaining a lot of interest in the software development community. There are a few blog post introducing Machine Learning on Code principles and associated research fields but the tools are still in the early stages of development. In other words, an exciting challenge and collaboration opportunity for the Open Source community.

Feature image via Pixabay.


Leave a Reply

Your email address will not be published. Required fields are marked *