This post was inspired from seeing multiple bugs coming from the same file in a project in a short period of time. I could easily see all types of code smells in this particular file. So I had a hypothesis, if I could find the most edited files in the repo I would likely find the files with the most technical debt i.e. most code smells. So I created a very simple powershell script to test my hypothesis.
Skip to the solution section below if you just want to see code.
Why do the most edited files matter ?
Why do files change? In most cases files change for two reasons, there is a bug or there are changes that are needed to support a new feature. Lots of changes in a file due to bugs is obviously a good reason to take a closer look at a file.
Lots of changes to a file in order to support new features is also a good reason to take a closer look at a file since it could mean that the rest of your project may be too tightly coupled to things in this file or maybe the file has too many responsibilities.
Different approaches to the problem
The solution below can easily be converted into some other scripting language if needed but since powershell is cross-platform you should be able to run this on any some what modern computer. Tested both on windows and OSX
Solution
This is the pretty script version you can just copy paste this into a .ps1
file and run it.
- Notice that the last command argument
-First 10
can be changed to control the top results you want displayed. - Add a
-#
to the git command to control how many commits you want to run the script on. For examplegit log --pretty=format: --name-only -50
will only run the analysis on the last 50 commits.
1
2
3
4
5
git log --pretty=format: --name-only |
Where-Object { ![string]::IsNullOrEmpty($_) } |
Group-Object |
Sort-Object -Property Count -Descending |
Select-Object -Property Count, Name -First 10
Here is a shortened dirty one-liner version
1
git log --pretty=format: --name-only | ? { ![string]::IsNullOrEmpty($_) } | group | Sort-Object -Property Count -Descending | select -Property Count, Name -First 10
Code breakdown
Using some built-in powershell cmdlets and pipelining we transform some git text output into some useful statistics about a repository.
git log --pretty=format: --name-only
gives you the list of files changed per commit.Where-Object { ![string]::IsNullOrEmpty($_) }
filters out all empty lines that are used by the git output from the first command to separate commits.Group-Object
groups all the lines that are the same together and gives you the number of objects in each group.Sort-Object -Property Count -Descending
sorts the groups by thecount
which is number of same files from highest to lowest.Select-Object -Property Count, Name -First 10
selects the first 10 objects and displays theCount
andName
properties. (The properties are from the objects thatGroup-Object
has passed down the pipeline)
Some notes on the “Dirty one-liner”
This one-liner is essentially the same thing as the pretty version except that it is all on one-line and uses the aliases for Where-Object
, Group-Object
, and Select-Object
.
Example output
Output after running the code should be something like this :
Count Name
----- ----
33 _config.yml
16 _layouts/post.html
15 README.md
13 _includes/sidebar.html
Room for improvement
- Instead of just sorting by the number of times a file has changed sort by the number of lines in aggregate that has changed in a file.
- Maybe as a part 2 complementary type script, have a way to determine the most changed section of code maybe even narrow it down to functions or class names.
- Filter out certain projects in a repository to target analysis to a specific project.
- Filter out non bug-fix changes.
- Filter out bug-fix changes.