Daniel Morales German (University of Victoria)

“When history matters: using Software Repositories to address Source Code Provenance”


Abstract: Who owns the copyright a project? The answer to this question determines who can license the code (either commercially or open source). It is also an important question during business acquisitions. Unfortunately, this is not always an easy question to answer. On one hand, copying code is easy and there is lack of traceability in tools—such as editors, and version control systems–of such copying and its source. On the other hand, software development, specially in open source, is increasingly becoming a team effort. In the absence of contributor copyright assignments (that transfer the ownership of a contribution to the project) the ownership of the source code becomes difficult to asses. Even those who reuse open source need to be concerned that the software they are reusing is properly licensed.In order to answer the question of ownership of copyright one needs to first answer the question: ”what is the provenance of this code?”. In this lecture I will describe the challenges of provenance discovery. These challenges include: the discovery of reliable corpora, the use of Bertillonage and clone detection to identify copied code, the analysis of the history of development to asses who are copyright authors of a system. I will also describe the challenges that copyright law impose on legally defining how software modifications contribute (or not) to the overall copyright of a system. Finally, I will overview the research we have performed during the last years on provenance discovery, and how we have used software repositories to do it.


Ahmed E. Hassan (Queen’s University)

“Mining Software Repositories: Accomplishments and Challenges”


Abstract: Software engineering data (such as code bases, execution traces, historical code changes, mailing lists, and bug databases) contains a wealth of information about a project?s status, progress, and evolution. Using well established data mining techniques, practitioners and researchers can explore the potential of this valuable data in order to better manage their projects and to produce higher quality software systems that are delivered on time and within budget. This lecture will present the latest research in mining Software Engineering (SE) data, discusses challenges associated with mining SE data, highlight SE data mining success stories, and outlines future research directions. Attendees will acquire the knowledge and skills needed to perform research or conduct practice in the field and to integrate data mining techniques in their own research or practice. A hands-on illustration of the commonly used analysis tools such as R and WEKA will also be provided.


Shane McIntosh (McGill University)

“Building on an unsound foundation: How release pipelines can impact our predictive models”


Abstract: Mining Software Repositories (MSR) researchers use complex statistical regression models or machine learning techniques to understand software engineering phenomena. We apply MSR techniques to analyze historical data that is stored in software repositories. As the MSR field has matured, and MSR techniques have become more robust, the size of our studied datasets (in terms of number of projects) have grown. While this growth attempts to tackle natural external validity concerns, they increase internal validity risks.
In this lecture, I will discuss the importance of understanding the release pipeline of our studied projects. I will elaborate on how naive treatment of files, releases, and branches can lead to noise and biases that threaten the validity of MSR analyses. Furthermore, I will provide a framework for how release pipeline biases can be addressed. A hands-on component showing how to extract and leverage release data from repositories will also be provided.

Alberto Bacchelli (Delft University of Technology)

“Supporting the human aspects of software engineering”



Abstract: Software development leads to the creation of large amounts of data, such as source code changes, defects, and test executions. Software Analytics aims at uncovering patterns and actionable insights from this data to support the human aspects of software development and maintenance. Selecting the right data is key to the success of Software Analytics. Unstructured software data (e.g., emails, bug descriptions, and technical forum discussions) is a valuable form of data as it opens a unique view on human factors involved in a software project; yet it is hard to harness. In the first part of the talk, I introduce how automated techniques based on text search, machine learning, and island parsing can be used to mine this data and obtain actionable results. Data alone is not enough: It has to be analyzed to answer the right questions to tackle relevant developers’ needs. In the second part of this lecture, I will introduce how qualitative research methods can be used to uncover developers’ needs. Particularly, as an example, I describe how we uncovered motivations, real outcomes, and fundamental challenges of Modern Code Review, thus opening a very promising research line to be tackled with Software Analytics. I also present a few analyses that can be done to support modern code review with data.