Open this publication in new window or tab >>2019 (English)Data set
Keywords
Software Maintenance, Software Evolution, Mining Software Repositories, Predictive Modeling, Software Quality
National Category
Software Engineering Computer Sciences
Research subject
Computer Science, Software Technology
Identifiers
urn:nbn:se:lnu:diva-98175 (URN)10.5281/zenodo.2590518 (DOI)
Note
The dataset contains these tables:
- x1151:
- The original dataset from Levin and Yehudai.
- despite its name, this dataset has only 1,149 commits, as two commits were duplicates in the original dataset.
- This dataset spanned 11 projects, each of which had between 99 and 114 commitsThis dataset has 71 features and spans the projects RxJava, hbase, elasticsearch, intellij-community, hadoop, drools, Kotlin, restlet-framework-java, orientdb, camel and spring-framework.
- gtools_ex (short for Git-Tools, extended):
- Contains 359,569 commits, analyzed using Git-Tools in extended mode
- It spans all commits and projects from the x1151 dataset as well.
- All 11 projects were analyzed, from the initial commit until the end of January 2019. For the projects Intellij and Kotlin, the first 35,000 resp. 30,000 commits were analyzed.
- This dataset introduces 35 new features (see list below), 22 of which are size- or density-related.
The dataset contains these views:
- geX_L (short for Git-tools, extended, with labels): Joins the commits' labels from x1151 with the extended attributes from gtools_ex, using the commits' hashes.
- jeX_L (short for joined, extended, with labels): Joins the datasets x1151 and gtools_ex entirely, based on the commits' hashes.
Features of the gtools_ex dataset:
- SHA1
- RepoPathOrUrl
- AuthorName
- CommitterName
- AuthorTime (UTC)
- CommitterTime (UTC)
- MinutesSincePreviousCommit: Double, describing the amount of minutes that passed since the previous commit. Previous refers to the parent commit, not the previous in time.
- Message: The commit's message/comment
- AuthorEmail
- CommitterEmail
- AuthorNominalLabel: All authors of a repository are analyzed and merged by Git-Density using some heuristic, even if they do not always use the same email address or name. This label is a unique string that helps identifying the same author across commits, even if the author did not always use the exact same identity.
- CommitterNominalLabel: The same as AuthorNominalLabel, but for the committer this time.
- IsInitialCommit: A boolean indicating, whether a commit is preceded by a parent or not.
- IsMergeCommit: A boolean indicating whether a commit has more than one parent.
- NumberOfParentCommitsParentCommitSHA1s: A comma-concatenated string of the parents' SHA1 IDs
- NumberOfFilesAdded
- NumberOfFilesAddedNet: Like the previous property, but if the net-size of all changes of an added file is zero (i.e. when adding a file that is empty/whitespace or does not contain code), then this property does not count the file.
- NumberOfLinesAddedByAddedFiles
- NumberOfLinesAddedByAddedFilesNet: Like the previous property, but counts the net-lines
- NumberOfFilesDeleted
- NumberOfFilesDeletedNet: Like the previous property, but considers only files that had net-changes
- NumberOfLinesDeletedByDeletedFiles
- NumberOfLinesDeletedByDeletedFilesNet: Like the previous property, but counts the net-lines
- NumberOfFilesModified
- NumberOfFilesModifiedNet: Like the previous property, but considers only files that had net-changes
- NumberOfFilesRenamed
- NumberOfFilesRenamedNet: Like the previous property, but considers only files that had net-changes
- NumberOfLinesAddedByModifiedFiles
- NumberOfLinesAddedByModifiedFilesNet: Like the previous property, but counts the net-lines
- NumberOfLinesDeletedByModifiedFiles
- NumberOfLinesDeletedByModifiedFilesNet: Like the previous property, but counts the net-lines
- NumberOfLinesAddedByRenamedFiles
- NumberOfLinesAddedByRenamedFilesNet: Like the previous property, but counts the net-lines
- NumberOfLinesDeletedByRenamedFiles
- NumberOfLinesDeletedByRenamedFilesNet: Like the previous property, but counts the net-lines
- Density: The ratio between the two sums of all lines added+deleted+modified+renamed and their resp. gross-version. A density of zero means that the sum of net-lines is zero (i.e. all lines changes were just whitespace, comments etc.). A density of of 1 means that all changed net-lines contribute to the gross-size of the commit (i.e. no useless lines with e.g. only comments or whitespace).
- AffectedFilesRatioNet: The ratio between the sums of NumberOfFilesXXX and NumberOfFilesXXXNet
This dataset is supporting the paper "Importance and Aptitude of Source code Density for Commit Classification into Maintenance Activities", as submitted to the QRS2019 conference (The 19th IEEE International Conference on Software Quality, Reliability, and Security). Citation: Hönel, S., Ericsson, M., Löwe, W. and Wingkvist, A., 2019. Importance and Aptitude of Source code Density for Commit Classification into Maintenance Activities. In The 19th IEEE International Conference on Software Quality, Reliability, and Security.
The dataset was compressed using 7z and the PPMd algorithm.
2020-09-252020-09-252024-01-22Bibliographically approved