Thursday, June 5, 2008

Enumerating Files In The Linux or Unix Shell - More Improvements

Hey There,

Today, we're going to take a little more time and devote it to the folks on all the boards and networking sites who've made excellent suggestions for improvement on some of the shell/Perl scripts and other posts we've put out here over time. As was mentioned the first time we decided to start posting these suggestions and improvements, rather than update old posts that never get any attention, we'll be putting the good stuff out here with updated timestamps so that anyone who reads this blog doesn't ever have to wonder if an issue they find has been addressed. And, also, of course, as a thanks for some great tips for refinement.

Today we're going to look at a couple of improvements and refinements to the shell one-liner to enumerate file types, suggested by folks via email and on the boards over at linuxtoday.com, LXer.com, fsdaily.com, and many other venues.

Again, since our policy regarding privacy is to regard everyone's privacy as equal and well deserved, we will only be referring to the folks who contributed by the nicknames and/or screen-names they used in "talkbacks" which are already posted on the internet. And, then, only if it's relevant and unavoidable.

The suggestions for change to this one-liner were excellent, many, and more brief. They also indirectly pointed out the fact that, when I wrote it, I was obviously obsessing over awk ;) The original one liner was this:

find . -print|xargs file|awk '{$1="";x[$0]++;}END{for(y in x)printf("%d\t%s\n",x[y],y);}'|sort -nr

To this, it was first suggested that names with spaces in them wouldn't work. This is absolutely true, and can be countered using a variation on "xargs" in the command line, like this:

find . -print|xargs -Ivar file "var"|awk '{$1="";x[$0]++;}END{for(y in x)printf("%d\t%s\n",x[y],y);}'|sort -nr

This effectively "double quotes" the arguments passed to xargs, but, while I was thinking about that I realized that there would also be additional work you'd need to do for single quotes/apostrophes, etc, to keep them from screwing up the command chain, as well. It was beginning to seem more and more like a solution that could definitely use some re-tooling.

So, naturally, one suggestion I received was to do it without using xargs. Good deal: One less hassle, to my way of thinking, and removing a whole lot of issues that didn't have to exist. The difference here is the use of the -exec flag with the find command, rather than piping to xargs:

find . -print -exec file {} \;|awk '{$1="";x[$0]++;}END{for(y in x)printf("%d\t%s\n",x[y],y);}'|sort -nr

The next suggestion I received was to do it without using awk, but keeping xargs. This has the advantage of removing one additional external command (standard though it may be) from the process. And removing awk can make things a lot less confusing for most folks (myself included, which is probably why I used it originally. Not out of a twisted desire to cause myself grief, but to try and get more comfortable with it ;) That suggestion looked like this:

find . -print | xargs file -b | sort | uniq -c | sort -nr

But, this took us back to the xargs quoting and space-in-filename issue. This is the final suggestion that came from that community debate, which I think is probably the best (as in most succinct and utile) since it does it without awk, addresses the issues with xargs and can handle all the issues raised above:

find . -print0 | xargs -0r file -b | sort | uniq -c | sort -nr

If you want to check out this interaction, to gain some more insight into the thought behind each version, you can find it here on linuxtoday.com (at least for a while, assuming it will get moved eventually).

And, once again, a huge "Thank you" to anyone and everyone who's helpful criticism proved that, not only is there more than one way to skin a cat, there are far more efficient ways to do it than you or I may imagine in one sitting (but, please, don't skin any cats ;)

There's probably someone out there who knows a way to do it even better. Which is, of course, the beauty of Linux and Unix and why I enjoy working in the shell. It can be as simple or as complicated as you need it to be, and is flexible enough to allow users' the creativity to determine the path and the outcome of virtually everything that can be accomplished using either OS (or both :)

Have a great morning/day/afternoon/evening,

, Mike