Friday, January 4, 2008

Getting Rid Of HTML Tags And Leaving The Comments

Hey there,

If you've ever found yourself stuck in a spot where you needed to convert an html page to straight up text (for whatever reason), today's script might be a nice addition to your tool-kit. It's been tested in ash and bash on Linux and on sh, ksh and bash for Solaris.

The main reason I wrote this script was to break down html pages (at the code level) by removing all the html tags, while leaving all the comments. I primarily prefer to do this, when I can, in the shell rather than cut-and-paste from Firefox or Internet Explorer since the cut-and-paste method has a nasty habit of removing all formatting. Well, that's not entirely true. The carriage returns do usually manage to make it through unscathed ;)

This shell script is basically a wrapper for a few sed commands. While most readers are probably somewhat familiar with sed and how to use it, I find that a lot of folks on the forums request help when it comes to using some of its more advanced functionality, like matching patterns across multiple lines. This script matches (and removes) any tags that begin with the "<" character and end with the ">" character. I use it for html, but it could easily be used (without any porting at all) on any markup language file that uses the same tagging convention.

For clarity, I split up the single-line match from the multi-line match. You'll note that the second invocation of sed is where we write our expression to work on a html tag-pair that spans any number of lines greater than one. We'll go into that more in a future post, but for now, note that we basically find all opening "<" characters that don't have an ending ">" character on the same line, and consume everything on every line all the way up to, and including, the ending ">" character. The simplest solution, once we've determined that we're going to traverse multiple lines, is to replace all newlines ("\n") with spaces. So, basically, when we do the multi-line match, we're converting that multi-line entry into a single line so that the match can be made. As I said, this subject can be a little too convoluted to go into too much detail for the purposes of this post ;)

I've called this script htstrip (call it whatever you like) and you can invoke it like this:

host # ./htstrip filea fileb filec <-- Doesn't matter what the file names are; they don't have to be .htm files, etc.

And, as an example. This is the type of output you could expect to see:

host # cat filea.htm
<a> bob1 </a>
<a> bob2
<a> bob3
<!-- this is
a comment
<a> bob5
host # ./htstrip filea.htm
host # cat filea.htm
<-- Note that our original file is saved as filea.htm.old, just in case
<!-- this is
a comment

Note, that, as I mentioned in the title, I've purposefully made it so that standard html comments will remain. If you don't want these in your output either, a small modification to the script can make sure those go away too. Note also that we use shorthand in the "for x" loop. This post has nothing to do with that, but it's nice to know that you don't need to write "for x in BLAH" if you're using the default input. Technically, it's a better idea to use the full form. Especially if you're writing a huge script or are dealing with issues of scope, etc. ...For another day.

Hope this helps you out :)

Creative Commons License

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License


# htstrip - 2008 - Mike Golvach -
# Usage: htstrip filea fileb filen
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License;

trap 'rm -f temp temp2 temp3;exit 1' 1 2 3 9 15

if [ $# -lt 1 ]
echo "htstrip needs to know what files you want to strip!"
exit 1
for x
sed -e 's/<[^>]*>//g' $x >> temp
sed -e '/<[^>]*$/{
s/\n/ /
}' temp >> temp2
sed -e '/^ *$/d' temp2 >> temp3
rm temp temp2
mv $x ${x}.old
mv temp3 $x

, Mike