lnu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
359,569 commits with source code density; 1149 commits of which have software maintenance activity labels (adaptive, corrective, perfective)
Linnaeus University, Faculty of Technology, Department of computer science and media technology (CM). (DISA;DSIQ;DISTA)ORCID iD: 0000-0001-7937-1645
2019 (English)Data set
Physical description [en]

This dataset comes as SQL-importable file and is compatible with the widely available MariaDB- and MySQL-databases.

It is based on (and incorporates/extends) the dataset "1151 commits with software maintenance activity labels (corrective,perfective,adaptive)" by Levin and Yehudai (https://doi.org/10.5281/zenodo.835534).

The extensions to this dataset were obtained using Git-Tools, a tool that is included in the Git-Density (https://doi.org/10.5281/zenodo.2565238) suite. For each of the projects in the original dataset, Git-Tools was run in extended mode.

Place, publisher, year
2019.
Version
1.0
Keywords [en]
Software Maintenance, Software Evolution, Mining Software Repositories, Predictive Modeling, Software Quality
National Category
Software Engineering Computer Sciences
Research subject
Computer Science, Software Technology
Identifiers
URN: urn:nbn:se:lnu:diva-98175DOI: 10.5281/zenodo.2590518OAI: oai:DiVA.org:lnu-98175DiVA, id: diva2:1470562
Note

The dataset contains these tables:

- x1151:

  - The original dataset from Levin and Yehudai.

  - despite its name, this dataset has only 1,149 commits, as two commits were duplicates in the original dataset.

  - This dataset spanned 11 projects, each of which had between 99 and 114 commitsThis dataset has 71 features and spans the projects RxJava, hbase, elasticsearch, intellij-community, hadoop, drools, Kotlin, restlet-framework-java, orientdb, camel and spring-framework.

- gtools_ex (short for Git-Tools, extended):

  - Contains 359,569 commits, analyzed using Git-Tools in extended mode

  - It spans all commits and projects from the x1151 dataset as well.

  - All 11 projects were analyzed, from the initial commit until the end of January 2019. For the projects Intellij and Kotlin, the first 35,000 resp. 30,000 commits were analyzed.

  - This dataset introduces 35 new features (see list below), 22 of which are size- or density-related.

The dataset contains these views:

- geX_L (short for Git-tools, extended, with labels): Joins the commits' labels from x1151 with the extended attributes from gtools_ex, using the commits' hashes.

- jeX_L (short for joined, extended, with labels): Joins the datasets x1151 and gtools_ex entirely, based on the commits' hashes.

 

Features of the gtools_ex dataset:

- SHA1

- RepoPathOrUrl

- AuthorName

- CommitterName

- AuthorTime (UTC)

- CommitterTime (UTC)

- MinutesSincePreviousCommit: Double, describing the amount of minutes that passed since the previous commit. Previous refers to the parent commit, not the previous in time.

- Message: The commit's message/comment

- AuthorEmail

- CommitterEmail

- AuthorNominalLabel: All authors of a repository are analyzed and merged by Git-Density using some heuristic, even if they do not always use the same email address or name. This label is a unique string that helps identifying the same author across commits, even if the author did not always use the exact same identity.

- CommitterNominalLabel: The same as AuthorNominalLabel, but for the committer this time.

- IsInitialCommit: A boolean indicating, whether a commit is preceded by a parent or not.

- IsMergeCommit: A boolean indicating whether a commit has more than one parent.

- NumberOfParentCommitsParentCommitSHA1s: A comma-concatenated string of the parents' SHA1 IDs

- NumberOfFilesAdded

- NumberOfFilesAddedNet: Like the previous property, but if the net-size of all changes of an added file is zero (i.e. when adding a file that is empty/whitespace or does not contain code), then this property does not count the file.

- NumberOfLinesAddedByAddedFiles

- NumberOfLinesAddedByAddedFilesNet: Like the previous property, but counts the net-lines

- NumberOfFilesDeleted

- NumberOfFilesDeletedNet: Like the previous property, but considers only files that had net-changes

- NumberOfLinesDeletedByDeletedFiles

- NumberOfLinesDeletedByDeletedFilesNet: Like the previous property, but counts the net-lines

- NumberOfFilesModified

- NumberOfFilesModifiedNet: Like the previous property, but considers only files that had net-changes

- NumberOfFilesRenamed

- NumberOfFilesRenamedNet: Like the previous property, but considers only files that had net-changes

- NumberOfLinesAddedByModifiedFiles

- NumberOfLinesAddedByModifiedFilesNet: Like the previous property, but counts the net-lines

- NumberOfLinesDeletedByModifiedFiles

- NumberOfLinesDeletedByModifiedFilesNet: Like the previous property, but counts the net-lines

- NumberOfLinesAddedByRenamedFiles

- NumberOfLinesAddedByRenamedFilesNet: Like the previous property, but counts the net-lines

- NumberOfLinesDeletedByRenamedFiles

- NumberOfLinesDeletedByRenamedFilesNet: Like the previous property, but counts the net-lines

- Density: The ratio between the two sums of all lines added+deleted+modified+renamed and their resp. gross-version. A density of zero means that the sum of net-lines is zero (i.e. all lines changes were just whitespace, comments etc.). A density of of 1 means that all changed net-lines contribute to the gross-size of the commit (i.e. no useless lines with e.g. only comments or whitespace).

- AffectedFilesRatioNet: The ratio between the sums of NumberOfFilesXXX and NumberOfFilesXXXNet

 

This dataset is supporting the paper "Importance and Aptitude of Source code Density for Commit Classification into Maintenance Activities", as submitted to the QRS2019 conference (The 19th IEEE International Conference on Software Quality, Reliability, and Security). Citation: Hönel, S., Ericsson, M., Löwe, W. and Wingkvist, A., 2019. Importance and Aptitude of Source code Density for Commit Classification into Maintenance Activities. In The 19th IEEE International Conference on Software Quality, Reliability, and Security.

The dataset was compressed using 7z and the PPMd algorithm.

Available from: 2020-09-25 Created: 2020-09-25 Last updated: 2024-01-22Bibliographically approved
In thesis
1. Quantifying Process Quality: The Role of Effective Organizational Learning in Software Evolution
Open this publication in new window or tab >>Quantifying Process Quality: The Role of Effective Organizational Learning in Software Evolution
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Real-world software applications must constantly evolve to remain relevant. This evolution occurs when developing new applications or adapting existing ones to meet new requirements, make corrections, or incorporate future functionality. Traditional methods of software quality control involve software quality models and continuous code inspection tools. These measures focus on directly assessing the quality of the software. However, there is a strong correlation and causation between the quality of the development process and the resulting software product. Therefore, improving the development process indirectly improves the software product, too. To achieve this, effective learning from past processes is necessary, often embraced through post mortem organizational learning. While qualitative evaluation of large artifacts is common, smaller quantitative changes captured by application lifecycle management are often overlooked. In addition to software metrics, these smaller changes can reveal complex phenomena related to project culture and management. Leveraging these changes can help detect and address such complex issues.

Software evolution was previously measured by the size of changes, but the lack of consensus on a reliable and versatile quantification method prevents its use as a dependable metric. Different size classifications fail to reliably describe the nature of evolution. While application lifecycle management data is rich, identifying which artifacts can model detrimental managerial practices remains uncertain. Approaches such as simulation modeling, discrete events simulation, or Bayesian networks have only limited ability to exploit continuous-time process models of such phenomena. Even worse, the accessibility and mechanistic insight into such gray- or black-box models are typically very low. To address these challenges, we suggest leveraging objectively captured digital artifacts from application lifecycle management, combined with qualitative analysis, for efficient organizational learning. A new language-independent metric is proposed to robustly capture the size of changes, significantly improving the accuracy of change nature determination. The classified changes are then used to explore, visualize, and suggest maintenance activities, enabling solid prediction of malpractice presence and -severity, even with limited data. Finally, parts of the automatic quantitative analysis are made accessible, potentially replacing expert-based qualitative analysis in parts.

Place, publisher, year, edition, pages
Växjö: Linnaeus University Press, 2023
Series
Linnaeus University Dissertations ; 504
Keywords
Software Size, Software Metrics, Commit Classification, Maintenance Activities, Software Quality, Process Quality, Project Management, Organizational Learning, Machine Learning, Visualization, Optimization
National Category
Computer and Information Sciences Software Engineering Mathematical Analysis Probability Theory and Statistics
Research subject
Computer Science, Software Technology; Computer Science, Information and software visualization; Computer and Information Sciences Computer Science, Computer Science; Statistics/Econometrics
Identifiers
urn:nbn:se:lnu:diva-124916 (URN)10.15626/LUD.504.2023 (DOI)9789180820738 (ISBN)9789180820745 (ISBN)
Public defence
2023-09-29, House D, D1136A, 351 95 Växjö, Växjö, 13:00 (English)
Opponent
Supervisors
Available from: 2023-09-28 Created: 2023-09-27 Last updated: 2024-05-06Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Hönel, Sebastian

Search in DiVA

By author/editor
Hönel, Sebastian
By organisation
Department of computer science and media technology (CM)
Software EngineeringComputer Sciences
Hönel, S., Ericsson, M., Löwe, W. & Wingkvist, A. (2019). Importance and Aptitude of Source code Density for Commit Classification into Maintenance Activities. In: Dr. David Shepherd (Ed.), 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS): . Paper presented at The 19th IEEE International Conference on Software Quality, Reliability, and Security, July 22-26, 2019, Sofia, Bulgaria (pp. 109-120). IEEE

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 235 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf