SYNOPSIS
htsplit
[
DESCRIPTION
The htsplit command is used as a filter process to split an HTML file into tokens, which can then be further processed by other shell scripts.
htsplit reads input from the file arguments on the command line; if there are none, it reads from the standard input. By default, the output produced by htsplit is functionally identical to the input, that is, a browser displays the input and the output in the same way.
The following rules are used to split the HTML input into tokens:
-
At any point where it is possible to split between words, without effecting the HTML, a newline is added. Note that this usually means splitting between words, but may in certain cases mean that you can't split between an HTML command, and a word, since adding a newline would cause new whitespace.
-
All HTML tag names and attributes are converted to uppercase while attribute values retain their original case.
- All comments are retained.
Options
-b -
breaks up all tokens even when adding a newline produces output that is no longer functionally identical to the input.
-c -
removes comments (all tags beginning with <!--). This also causes all scripting information in the HTML file to be lost.
-e convert|separate-
specifies how to handle entity references.
If set to separate, entity references are pulled out and displayed on a separate line. For example, færies is displayed as:
f æ ries
If set to convert, entity references are converted to ASCII representations. For example, & is displayed as &.
To perform conversions, htsplit uses a list of entity definitions in $ROOTDIR/etc/entities (see FILES) and other entity definition files specified with the
-f option.If an entity is undefined, it is converted to white space. This results in the text before and after the entity appearing on separate lines. For example, if aelig is undefined, færies is displayed as:
f ries
You can specify additional entity files to use for conversions with the
-f option. -f entfile-
specifies an additional entity definition file to be used with
-e convert. This file is in the same format as $ROOTDIR/etc/entities (see FILES). When an entity is defined in both the additional file and in $ROOTDIR/etc/entities, htsplit uses the definition from the additional file.The
-f option can appear multiple times to specify multiple additional entity definition files. If an entity is defined in more than one file, the definition from the-f -specified file that appears latest on the command line is used. -o outfile-
sends the output of htsplit to outfile rather than the standard output.
-x -
processes the input as XML rather HTML. Tag names are not converted to uppercase and XML directives beginning with <? are recognized and retained.
FILES
- $ROOTDIR/etc/entities
-
contains a list of entity definitions that htsplit uses when converting entities to ASCII representations. Each entry in the list is an entity reference followed by a single space and then the ASCII representation of that entity. Additional spaces are considered to be part of the ASCII representation. For example, in this file, the entity for a non-breaking space ( ) is defined as:
nbsp<space><space>
That is, it is the entity name followed by two spaces. The first space is the separator and the second space is the ASCII representation.
The ASCII representation of an entity may itself include entities which are also converted.
Lines beginning with the & character are treated as comments.
DIAGNOSTICS
Possible exit status values are:
AVAILABILITY
PTC MKS Toolkit for System Administrators
PTC MKS Toolkit for Developers
PTC MKS Toolkit for Interoperability
PTC MKS Toolkit for Professional Developers
PTC MKS Toolkit for Professional Developers 64-Bit Edition
PTC MKS Toolkit for Enterprise Developers
PTC MKS Toolkit for Enterprise Developers 64-Bit Edition
SEE ALSO
- Commands:
- htdiff
PTC MKS Toolkit 10.5 Documentation Build 40.