Thursday, March 20, 2008

Generating Only Unique Content From Two Somewhat Similar Files

Hey There,

I see this, in one variation or another, on the message boards from time to time. So, touching back on our earlier post regarding making our own diff to deal with directory permissions, today we're going to look at a script that performs another function not built in to the standard "diff" command on Linux or Unix, and fairly simply implemented in ksh or bash.

Our script today, takes the input of two files (generally text files) of different sizes; however, they can be of equal size. It doesn't seem to make a difference ;) The only restriction on the usage of the script we're presenting today is that, assuming both files are of unequal size, the smaller of the two files should be listed as the primary argument to the script (we'll call it "rdiff") and the larger file should be the secondary argument, like so:

host # ./rdiff smallfile largefile

and, of course, if they're of equal size:

host # ./rdiff file file

The output of the script will be a file named "Unique.out.smallfile.largefile" with "smallfile" and "largefile" being the values of the file names passed to the script on the command line.

Basically, our script uses sed and grep to determine if lines in the smaller file are duplicated in the larger file. If those lines are duplicated, they are then removed from the final output, so that your "unique" output file only includes lines that existed exclusively in the small and large file. All duplicate information is removed.

Enjoy and cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#/bin/ksh

#
# rdiff - find unique content in two files
#
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

if [ $# -ne 2 ]
then
echo "Usage: $0 SmallFile LargeFile"
exit
fi

FileA=$1
FileB=$2

cp $FileB ${FileB}.old

for x in `<$FileA`
do
grep ^${x}$ $FileB >/dev/null 2>&1
if [ $? -eq 0 ]
then
echo $x >>tmpfile
fi
done

for x in `<tmpfile`
do
sed "/^$x$/d" $FileB >>newtmpfile
mv newtmpfile Unique.out.${FileB}.${FileA}
done

rm tmpfile


, Mike