perlpodspeccopy - Plain Old Documentation: format specification and notes


NAME

perlpodspeccopy - Plain Old Documentation: format specification and notes


DISCLAIMER

This is a pod file used for testing purposes by the test suite, please see the perlpodspec manpage.


DESCRIPTION

This document is detailed notes on the Pod markup language. Most people will only have to read perlpod to know how to write in Pod, but this document may answer some incidental questions to do with parsing and rendering Pod.

In this document, ``must'' / ``must not'', ``should'' / ``should not'', and ``may'' have their conventional (cf. RFC 2119) meanings: ``X must do Y'' means that if X doesn't do Y, it's against this specification, and should really be fixed. ``X should do Y'' means that it's recommended, but X may fail to do Y, if there's a good reason. ``X may do Y'' is merely a note that X can do Y at will (although it is up to the reader to detect any connotation of ``and I think it would be nice if X did Y'' versus ``it wouldn't really bother me if X did Y'').

Notably, when I say ``the parser should do Y'', the parser may fail to do Y, if the calling application explicitly requests that the parser not do Y. I often phrase this as ``the parser should, by default, do Y.'' This doesn't require the parser to provide an option for turning off whatever feature Y is (like expanding tabs in verbatim paragraphs), although it implicates that such an option may be provided.


Pod Definitions

Pod is embedded in files, typically Perl source files -- although you can write a file that's nothing but Pod.

A line in a file consists of zero or more non-newline characters, terminated by either a newline or the end of the file.

A newline sequence is usually a platform-dependent concept, but Pod parsers should understand it to mean any of CR (ASCII 13), LF (ASCII 10), or a CRLF (ASCII 13 followed immediately by ASCII 10), in addition to any other system-specific meaning. The first CR/CRLF/LF sequence in the file may be used as the basis for identifying the newline sequence for parsing the rest of the file.

A blank line is a line consisting entirely of zero or more spaces (ASCII 32) or tabs (ASCII 9), and terminated by a newline or end-of-file. A non-blank line is a line containing one or more characters other than space or tab (and terminated by a newline or end-of-file).

(Note: Many older Pod parsers did not accept a line consisting of spaces/tabs and then a newline as a blank line -- the only lines they considered blank were lines consisting of no characters at all, terminated by a newline.)

Whitespace is used in this document as a blanket term for spaces, tabs, and newline sequences. (By itself, this term usually refers to literal whitespace. That is, sequences of whitespace characters in Pod source, as opposed to ``E<32>'', which is a formatting code that denotes a whitespace character.)

A Pod parser is a module meant for parsing Pod (regardless of whether this involves calling callbacks or building a parse tree or directly formatting it). A Pod formatter (or Pod translator) is a module or program that converts Pod to some other format (HTML, plaintext, TeX, PostScript, RTF). A Pod processor might be a formatter or translator, or might be a program that does something else with the Pod (like counting words, scanning for index points, etc.).

Pod content is contained in Pod blocks. A Pod block starts with a line that matches <m/\A=[a-zA-Z]/>, and continues up to the next line that matches m/\A=cut/ -- or up to the end of the file, if there is no m/\A=cut/ line.

Within a Pod block, there are Pod paragraphs. A Pod paragraph consists of non-blank lines of text, separated by one or more blank lines.

For purposes of Pod processing, there are four types of paragraphs in a Pod block:

For example: consider the following paragraphs:

  # <- that's the 0th column
  =head1 Foo
  Stuff
    $foo->bar
  =cut

Here, ``=head1 Foo'' and ``=cut'' are command paragraphs because the first line of each matches m/\A=[a-zA-Z]/. ``[space][space]$foo->bar'' is a verbatim paragraph, because its first line starts with a literal whitespace character (and there's no ``=begin''...``=end'' region around).

The ``=begin identifier'' ... ``=end identifier'' commands stop paragraphs that they surround from being parsed as ordinary or verbatim paragraphs, if identifier doesn't begin with a colon. This is discussed in detail in the section About Data Paragraphs and ``=begin/=end'' Regions.


Pod Commands

This section is intended to supplement and clarify the discussion in Command Paragraph in the perlpod manpage. These are the currently recognized Pod commands:

``=head1'', ``=head2'', ``=head3'', ``=head4''
This command indicates that the text in the remainder of the paragraph is a heading. That text may contain formatting codes. Examples:
  =head1 Object Attributes
  =head3 What B<Not> to Do!

``=pod''
This command indicates that this paragraph begins a Pod block. (If we are already in the middle of a Pod block, this command has no effect at all.) If there is any text in this command paragraph after ``=pod'', it must be ignored. Examples:
  =pod
  This is a plain Pod paragraph.
  =pod This text is ignored.

``=cut''
This command indicates that this line is the end of this previously started Pod block. If there is any text after ``=cut'' on the line, it must be ignored. Examples:
  =cut
  =cut The documentation ends here.
  =cut
  # This is the first line of program text.
  sub foo { # This is the second.

It is an error to try to start a Pod block with a ``=cut'' command. In that case, the Pod processor must halt parsing of the input file, and must by default emit a warning.

``=over''
This command indicates that this is the start of a list/indent region. If there is any text following the ``=over'', it must consist of only a nonzero positive numeral. The semantics of this numeral is explained in the About =over...=back Regions section, further below. Formatting codes are not expanded. Examples:
  =over 3
  =over 3.5
  =over

``=item''
This command indicates that an item in a list begins here. Formatting codes are processed. The semantics of the (optional) text in the remainder of this paragraph are explained in the About =over...=back Regions section, further below. Examples:
  =item
  =item *
  =item      *
  =item 14
  =item   3.
  =item C<< $thing->stuff(I<dodad>) >>
  =item For transporting us beyond seas to be tried for pretended
  offenses
  =item He is at this time transporting large armies of foreign
  mercenaries to complete the works of death, desolation and
  tyranny, already begun with circumstances of cruelty and perfidy
  scarcely paralleled in the most barbarous ages, and totally
  unworthy the head of a civilized nation.

``=back''
This command indicates that this is the end of the region begun by the most recent ``=over'' command. It permits no text after the ``=back'' command.

``=begin formatname''
This marks the following paragraphs (until the matching ``=end formatname'') as being for some special kind of processing. Unless ``formatname'' begins with a colon, the contained non-command paragraphs are data paragraphs. But if ``formatname'' does begin with a colon, then non-command paragraphs are ordinary paragraphs or data paragraphs. This is discussed in detail in the section About Data Paragraphs and ``=begin/=end'' Regions.

It is advised that formatnames match the regexp m/\A:?[-a-zA-Z0-9_]+\z/. Implementors should anticipate future expansion in the semantics and syntax of the first parameter to ``=begin''/``=end''/``=for''.

``=end formatname''
This marks the end of the region opened by the matching ``=begin formatname'' region. If ``formatname'' is not the formatname of the most recent open ``=begin formatname'' region, then this is an error, and must generate an error message. This is discussed in detail in the section About Data Paragraphs and ``=begin/=end'' Regions.

``=for formatname text...''
This is synonymous with:
     =begin formatname
     text...
     =end formatname

That is, it creates a region consisting of a single paragraph; that paragraph is to be treated as a normal paragraph if ``formatname'' begins with a ``:''; if ``formatname'' doesn't begin with a colon, then ``text...'' will constitute a data paragraph. There is no way to use ``=for formatname text...'' to express ``text...'' as a verbatim paragraph.

``=encoding encodingname''
This command, which should occur early in the document (at least before any non-US-ASCII data!), declares that this document is encoded in the encoding encodingname, which must be an encoding name that the Encode manpage recognizes. (Encode's list of supported encodings, in the Encode::Supported manpage, is useful here.) If the Pod parser cannot decode the declared encoding, it should emit a warning and may abort parsing the document altogether.

A document having more than one ``=encoding'' line should be considered an error. Pod processors may silently tolerate this if the not-first ``=encoding'' lines are just duplicates of the first one (e.g., if there's a ``=encoding utf8'' line, and later on another ``=encoding utf8'' line). But Pod processors should complain if there are contradictory ``=encoding'' lines in the same document (e.g., if there is a ``=encoding utf8'' early in the document and ``=encoding big5'' later). Pod processors that recognize BOMs may also complain if they see an ``=encoding'' line that contradicts the BOM (e.g., if a document with a UTF-16LE BOM has an ``=encoding shiftjis'' line).

If a Pod processor sees any command other than the ones listed above (like ``=head'', or ``=haed1'', or ``=stuff'', or ``=cuttlefish'', or ``=w123''), that processor must by default treat this as an error. It must not process the paragraph beginning with that command, must by default warn of this as an error, and may abort the parse. A Pod parser may allow a way for particular applications to add to the above list of known commands, and to stipulate, for each additional command, whether formatting codes should be processed.

Future versions of this specification may add additional commands.


Pod Formatting Codes

(Note that in previous drafts of this document and of perlpod, formatting codes were referred to as ``interior sequences'', and this term may still be found in the documentation for Pod parsers, and in error messages from Pod processors.)

There are two syntaxes for formatting codes:

In parsing Pod, a notably tricky part is the correct parsing of (potentially nested!) formatting codes. Implementors should consult the code in the parse_text routine in Pod::Parser as an example of a correct implementation.

I<text> -- italic text
See the brief discussion in Formatting Codes in the perlpod manpage.

B<text> -- bold text
See the brief discussion in Formatting Codes in the perlpod manpage.

C<code> -- code text
See the brief discussion in Formatting Codes in the perlpod manpage.

F<filename> -- style for filenames
See the brief discussion in Formatting Codes in the perlpod manpage.

X<topic name> -- an index entry
See the brief discussion in Formatting Codes in the perlpod manpage.

This code is unusual in that most formatters completely discard this code and its content. Other formatters will render it with invisible codes that can be used in building an index of the current document.

Z<> -- a null (zero-effect) formatting code
Discussed briefly in Formatting Codes in the perlpod manpage.

This code is unusual is that it should have no content. That is, a processor may complain if it sees Z<potatoes>. Whether or not it complains, the potatoes text should ignored.

L<name> -- a hyperlink
The complicated syntaxes of this code are discussed at length in Formatting Codes in the perlpod manpage, and implementation details are discussed below, in About L<...> Codes. Parsing the contents of L<content> is tricky. Notably, the content has to be checked for whether it looks like a URL, or whether it has to be split on literal ``|'' and/or ``/'' (in the right order!), and so on, before E<...> codes are resolved.

E<escape> -- a character escape
See Formatting Codes in the perlpod manpage, and several points in Notes on Implementing Pod Processors.

S<text> -- text contains non-breaking spaces
This formatting code is syntactically simple, but semantically complex. What it means is that each space in the printable content of this code signifies a non-breaking space.

Consider:

    C<$x ? $y    :  $z>
    S<C<$x ? $y     :  $z>>

Both signify the monospace (c[ode] style) text consisting of ``$x'', one space, ``?'', one space, ``:'', one space, ``$z''. The difference is that in the latter, with the S code, those spaces are not ``normal'' spaces, but instead are non-breaking spaces.

If a Pod processor sees any formatting code other than the ones listed above (as in ``N<...>'', or ``Q<...>'', etc.), that processor must by default treat this as an error. A Pod parser may allow a way for particular applications to add to the above list of known formatting codes; a Pod parser might even allow a way to stipulate, for each additional command, whether it requires some form of special processing, as L<...> does.

Future versions of this specification may add additional formatting codes.

Historical note: A few older Pod processors would not see a ``>'' as closing a ``C<'' code, if the ``>'' was immediately preceded by a ``-''. This was so that this:

    C<$foo->bar>

would parse as equivalent to this:

    C<$foo-E<gt>bar>

instead of as equivalent to a ``C'' formatting code containing only ``$foo-'', and then a ``bar>'' outside the ``C'' formatting code. This problem has since been solved by the addition of syntaxes like this:

    C<< $foo->bar >>

Compliant parsers must not treat ``->'' as special.

Formatting codes absolutely cannot span paragraphs. If a code is opened in one paragraph, and no closing code is found by the end of that paragraph, the Pod parser must close that formatting code, and should complain (as in ``Unterminated I code in the paragraph starting at line 123: 'Time objects are not...'''). So these two paragraphs:

  I<I told you not to do this!
  Don't make me say it again!>

...must not be parsed as two paragraphs in italics (with the I code starting in one paragraph and starting in another.) Instead, the first paragraph should generate a warning, but that aside, the above code must parse as if it were:

  I<I told you not to do this!>
  Don't make me say it again!E<gt>

(In SGMLish jargon, all Pod commands are like block-level elements, whereas all Pod formatting codes are like inline-level elements.)


Notes on Implementing Pod Processors

The following is a long section of miscellaneous requirements and suggestions to do with Pod processing.


About L<...> Codes

As you can tell from a glance at perlpod, the L<...> code is the most complex of the Pod formatting codes. The points below will hopefully clarify what it means and how processors should deal with it.


About =over...=back Regions

``=over''...``=back'' regions are used for various kinds of list-like structures. (I use the term ``region'' here simply as a collective term for everything from the ``=over'' to the matching ``=back''.)


About Data Paragraphs and ``=begin/=end'' Regions

Data paragraphs are typically used for inlining non-Pod data that is to be used (typically passed through) when rendering the document to a specific format:

  =begin rtf
  \par{\pard\qr\sa4500{\i Printed\~\chdate\~\chtime}\par}
  =end rtf

The exact same effect could, incidentally, be achieved with a single ``=for'' paragraph:

  =for rtf \par{\pard\qr\sa4500{\i Printed\~\chdate\~\chtime}\par}

(Although that is not formally a data paragraph, it has the same meaning as one, and Pod parsers may parse it as one.)

Another example of a data paragraph:

  =begin html
  I like <em>PIE</em>!
  <hr>Especially pecan pie!
  =end html

If these were ordinary paragraphs, the Pod parser would try to expand the ``E</em>'' (in the first paragraph) as a formatting code, just like ``E<lt>'' or ``E<eacute>''. But since this is in a ``=begin identifier''...``=end identifier'' region and the identifier ``html'' doesn't begin have a ``:'' prefix, the contents of this region are stored as data paragraphs, instead of being processed as ordinary paragraphs (or if they began with a spaces and/or tabs, as verbatim paragraphs).

As a further example: At time of writing, no ``biblio'' identifier is supported, but suppose some processor were written to recognize it as a way of (say) denoting a bibliographic reference (necessarily containing formatting codes in ordinary paragraphs). The fact that ``biblio'' paragraphs were meant for ordinary processing would be indicated by prefacing each ``biblio'' identifier with a colon:

  =begin :biblio
  Wirth, Niklaus.  1976.  I<Algorithms + Data Structures =
  Programs.>  Prentice-Hall, Englewood Cliffs, NJ.
  =end :biblio

This would signal to the parser that paragraphs in this begin...end region are subject to normal handling as ordinary/verbatim paragraphs (while still tagged as meant only for processors that understand the ``biblio'' identifier). The same effect could be had with:

  =for :biblio
  Wirth, Niklaus.  1976.  I<Algorithms + Data Structures =
  Programs.>  Prentice-Hall, Englewood Cliffs, NJ.

The ``:'' on these identifiers means simply ``process this stuff normally, even though the result will be for some special target''. I suggest that parser APIs report ``biblio'' as the target identifier, but also report that it had a ``:'' prefix. (And similarly, with the above ``html'', report ``html'' as the target identifier, and note the lack of a ``:'' prefix.)

Note that a ``=begin identifier''...``=end identifier'' region where identifier begins with a colon, can contain commands. For example:

  =begin :biblio
  Wirth's classic is available in several editions, including:
  =for comment
   hm, check abebooks.com for how much used copies cost.
  =over
  =item
  Wirth, Niklaus.  1975.  I<Algorithmen und Datenstrukturen.>
  Teubner, Stuttgart.  [Yes, it's in German.]
  =item
  Wirth, Niklaus.  1976.  I<Algorithms + Data Structures =
  Programs.>  Prentice-Hall, Englewood Cliffs, NJ.
  =back
  =end :biblio

Note, however, a ``=begin identifier''...``=end identifier'' region where identifier does not begin with a colon, should not directly contain ``=head1'' ... ``=head4'' commands, nor ``=over'', nor ``=back'', nor ``=item''. For example, this may be considered invalid:

  =begin somedata
  This is a data paragraph.
  =head1 Don't do this!
  This is a data paragraph too.
  =end somedata

A Pod processor may signal that the above (specifically the ``=head1'' paragraph) is an error. Note, however, that the following should not be treated as an error:

  =begin somedata
  This is a data paragraph.
  =cut
  # Yup, this isn't Pod anymore.
  sub excl { (rand() > .5) ? "hoo!" : "hah!" }
  =pod
  This is a data paragraph too.
  =end somedata

And this too is valid:

  =begin someformat
  This is a data paragraph.
    And this is a data paragraph.
  =begin someotherformat
  This is a data paragraph too.
    And this is a data paragraph too.
  =begin :yetanotherformat
  =head2 This is a command paragraph!
  This is an ordinary paragraph!
    And this is a verbatim paragraph!
  =end :yetanotherformat
  =end someotherformat
  Another data paragraph!
  =end someformat

The contents of the above ``=begin :yetanotherformat'' ... ``=end :yetanotherformat'' region aren't data paragraphs, because the immediately containing region's identifier (``:yetanotherformat'') begins with a colon. In practice, most regions that contain data paragraphs will contain only data paragraphs; however, the above nesting is syntactically valid as Pod, even if it is rare. However, the handlers for some formats, like ``html'', will accept only data paragraphs, not nested regions; and they may complain if they see (targeted for them) nested regions, or commands, other than ``=end'', ``=pod'', and ``=cut''.

Also consider this valid structure:

  =begin :biblio
  Wirth's classic is available in several editions, including:
  =over
  =item
  Wirth, Niklaus.  1975.  I<Algorithmen und Datenstrukturen.>
  Teubner, Stuttgart.  [Yes, it's in German.]
  =item
  Wirth, Niklaus.  1976.  I<Algorithms + Data Structures =
  Programs.>  Prentice-Hall, Englewood Cliffs, NJ.
  =back
  Buy buy buy!
  =begin html
  <img src='wirth_spokesmodeling_book.png'>
  <hr>
  =end html
  Now now now!
  =end :biblio

There, the ``=begin html''...``=end html'' region is nested inside the larger ``=begin :biblio''...``=end :biblio'' region. Note that the content of the ``=begin html''...``=end html'' region is data paragraph(s), because the immediately containing region's identifier (``html'') doesn't begin with a colon.

Pod parsers, when processing a series of data paragraphs one after another (within a single region), should consider them to be one large data paragraph that happens to contain blank lines. So the content of the above ``=begin html''...``=end html'' may be stored as two data paragraphs (one consisting of ``<img src='wirth_spokesmodeling_book.png'>\n'' and another consisting of ``<hr>\n''), but should be stored as a single data paragraph (consisting of ``<img src='wirth_spokesmodeling_book.png'>\n\n<hr>\n'').

Pod processors should tolerate empty ``=begin something''...``=end something'' regions, empty ``=begin :something''...``=end :something'' regions, and contentless ``=for something'' and ``=for :something'' paragraphs. I.e., these should be tolerated:

  =for html
  =begin html
  =end html
  =begin :biblio
  =end :biblio

Incidentally, note that there's no easy way to express a data paragraph starting with something that looks like a command. Consider:

  =begin stuff
  =shazbot
  =end stuff

There, ``=shazbot'' will be parsed as a Pod command ``shazbot'', not as a data paragraph ``=shazbot\n''. However, you can express a data paragraph consisting of ``=shazbot\n'' using this code:

  =for stuff =shazbot

The situation where this is necessary, is presumably quite rare.

Note that =end commands must match the currently open =begin command. That is, they must properly nest. For example, this is valid:

  =begin outer
  X
  =begin inner
  Y
  =end inner
  Z
  =end outer

while this is invalid:

  =begin outer
  X
  =begin inner
  Y
  =end outer
  Z
  =end inner

This latter is improper because when the ``=end outer'' command is seen, the currently open region has the formatname ``inner'', not ``outer''. (It just happens that ``outer'' is the format name of a higher-up region.) This is an error. Processors must by default report this as an error, and may halt processing the document containing that error. A corollary of this is that regions cannot ``overlap'' -- i.e., the latter block above does not represent a region called ``outer'' which contains X and Y, overlapping a region called ``inner'' which contains Y and Z. But because it is invalid (as all apparently overlapping regions would be), it doesn't represent that, or anything at all.

Similarly, this is invalid:

  =begin thing
  =end hting

This is an error because the region is opened by ``thing'', and the ``=end'' tries to close ``hting'' [sic].

This is also invalid:

  =begin thing
  =end

This is invalid because every ``=end'' command must have a formatname parameter.


SEE ALSO

the perlpod manpage, PODs: Embedded Documentation in the perlsyn manpage, podchecker


AUTHOR

Sean M. Burke

 perlpodspeccopy - Plain Old Documentation: format specification and notes