Thursday, May 1, 2008

Using OD To Find Bad Characters In Files On Linux Or Unix

Hey there,

Today we're going to take a look at a command called "od" (short for "octal dump") and the sometimes elusive search for, and destruction of, those "mystery characters" that make your scripts not run correctly. Sure, there are programs out there like dos2unix that pretty much take care of this problem for you, but that's like taking candy from a baby ;) And there's always the possibility that they won't be able to find the bizarre character you're trying to get rid of. What do you do then?

We covered the other side of this issue in an earlier post on adding real control characters to your files. Now, it's time to finally undo that damage ;)

od is a great utility for finding out if there really are bad characters in your file in the first place. "cat -v" works, too, but you can only trust it as much as you trust your terminal settings. I generally assume my tty settings are correct, but they could be up to just about anything ;) For today's purposes, "cat -v" serves as an excellent corollary to od's output and, hopefully, makes it even easier to understand.

For instance, if you're looking for a control character in a file (we'll say ^M, since that one's the most popular), cat -v does a great job of bringing it out in the open, like this:

host # cat file
def
host # cat -v file
abc^Mdef


od can do the same thing, if only slightly differently:

host # od -c file
0000000 a b c \r d e f \n
0000010


Notice that, with the "-c" flag used, we can make the "^M" show up as an easily identifiable "\r". We also get to see what having that "actual" ^M in our file has done. Notice the second line in the "od" output (00000010). The first column of od's output is the byte-offset of the line. The first line, naturally starts at an offset of 0 and, since od only grabs the amount of bytes per line that you can actually see in the output, the second entry (the offset of the 6 letters and the new-line) should be 7 (This is because the numbering starts at zero - just in case you didn't know. No offense if you did ;). Another clue that something is amiss.

This is what the file would look like normally, using the same example as above, but without the ^M inserted:

host # cat file
abcdef
host # cat -v file
abcdef
host # od -c file
0000000 a b c d e f \n
0000007


Perfect :) The really nice thing about using od with the "-c" flag is that it prints out all regular characters in ASCII, which, studies have shown, most humans prefer reading ;) Only when od runs into a character it doesn't understand, does it do any substitution, as with the ^M. Below is the basic table of conversions the "-c" flag will do automatically:

null \0
backspace \b
form-feed \f
new-line \n
return \r
tab \t


After that, if od doesn't recognize a regular ASCII character, it will finally live up to its name and dump the octal value (of course, we could have forced it to do this all along, but we're trying to make this command seem accessible ;). Here's an example of what we see when od can't translate for us ASCII folks anymore:

host # cat file
bcdef
host # cat -v file
a^?bcdef
host # od -c file
0000000 a 177 b c d e f \n
0000010


Okay. So, in this instance, "cat -v" shows us what looks like a "backspace" or "erase" character (^?). od's output is slightly different. The "what would be the next line" offset is at 10. That's the sum of all 6 ASCII characters, the new-line and the 3-digit octal code, so everything's in order (even though it's not right).

The really sweet thing is that you, armed with the knowledge od has provided, can remove the character from the file with tr (or your fancy file/character manipulation command of choice) using the octal code, which makes it completely unnecessary for you to remember some bizarre symbol ;)

host # cat file
bcdef
host # cat -v file
a^?bcdef
host # od -c file
0000000 a 177 b c d e f \n
0000010
host # tr -d '\177' <file
abcdef
host # tr -d '\177' <file >file2
<--- Where the magic happens...
host # cat file2
abcdef
host # cat -v file2
abcdef
host # od -c file2
0000000 a b c d e f \n
0000007


Hope this makes the crusty old "od" command seem less daunting. You can do a whole lot more with it, of course, but there will be plenty of time for that later :)

Cheers,

, Mike