An adventure with a super useless one-liner to find the most common words in WordPress commit messages

I read some insight into Drupal committing and they had a chart of the most common words in drupal commit messages. I thought it would be interesting to do something like that with WordPress Core, so I through together a bash one-liner to find this. It’s not the most eloquent solution, but it answers the question that I had. Here is what I initially came up with.

svn log -rHEAD:1 -v --xml | xq '.log.logentry | .[].msg' | sed 's/.$//' | sed 's/^.//' | sed 's/\n/ /g' |  tr ' t' 'n' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 25

Let’s walk through this since there is enough piping going on, that it may not be the easiest to follow.

svn log -rHEAD:1 -v --xml

I start by getting an xml version of the SVN history, starting at the first changeset and going until the current head.

xq '.log.logentry | .[].msg'

Next, I use xq which takes xml and allows me to run jq commands on it. It’s a handy tool if you ever need to use xml data on the command line. In this case, I am taking what is inside <log><logentry> and then for each sub element, extracting the msg. At this point, the messages are on a single line wrapped in quotation marks with n to signify newlines. So I run three seds to fix that up.

 sed 's/.$//' | sed 's/^.//' | sed 's/\n/ /g'

I’m sure there is a better way to do this, but the first one removes the last character, the next one removes the first character, and the last one converts new lines to spaces. Since words are what we are aiming to look at, we need to get all the words onto their own lines.

 tr ' t' 'n'

tr is a powerful program for doing transforms of text. In this case, I am taking whitespace and turning it into actual newlines (rather than just the new line charachters). There is likely a more elegant way to have solved this, but my goal isn’t the best solution it’s the working one.

tr '[:upper:]' '[:lower:]'

Word and word are not equal, so we need to make everything a single case. In this case, I am again using tr, but now I am transforming upper case characters to lowercase.

sort | uniq -c | sort -nr | head -n 25

Counting things on the command line is something I have done so many times, I have an alias for a version of this. Sort puts everything in alphabetical order, uniq -c then counts how many uniq values there are and outputs it along with how many of each it counted. uniq requires things common things to be in adjacent lines, hence the initial sort. Next up, we want to sort based on the number and we want high numbers first. Finally, we output the top 25.

 28997 the
 20429 fixes
 17844 to
 17818 props
 15251 for
 15189 in
 14441 see
 10856 and
 10272 a
 7549 of
 5594 is
 5227 when
 5133 add
 4444 from
 4143 fix
 3847 *
 3821 on
 3489 use
 3320 that
 3267 this
 3064 with
 3043 remove
 2983 be
 2766 as 

That’s not super helpful. The isn’t my idea of interesting. So I guess I need to remove useless words. Since I have groff on this machine, I can use that and fgrep

 fgrep -v -w -f /usr/share/groff/1.19.2/eign

I also noticed that the second most common word is whitespace. Remember when we used to put two spaces between sentences? WordPress Core commit messages remember. So let’s add another sed command to the chain:

sed '/^$/d'

And now the final command to see the 25 most used words in WordPress Core Commit messages:

svn log -rHEAD:1 -v --xml | xq '.log.logentry | .[].msg' | sed 's/.$//' | sed 's/^.//' | sed 's/\n/ /g' | tr '[:upper:]' '[:lower:]' | tr ' t' 'n' | fgrep -v -w -f /usr/share/groff/1.19.2/eign | sed '/^$/d' | sort | uniq -c | sort -nr | head -n 25

And since you’ve made it this far, here is the list

 20429 fixes
 17818 props
 15189 in
 5594 is
 5133 add
 4143 fix
 3847 *
 3320 that
 3267 this
 3064 with
 3043 remove
 2766 as
 2435 an
 2432 it
 2109 post
 2103 if
 2080 are
 1889 don't
 1793 update
 1735 -
 1688 twenty
 1523 more
 1500 make
 1471 docs:
 1416 some 

Have an idea for another way to do this with the command line? I would love to hear it.





Comments accepted from Webmention and Pingback.