NAME

WWW::Publish - Web content maintenance for the easily bored

DESCRIPTION

Note: this module was previously known as WebSite::WYSIGASP. The name has changed as the first step of a merger with module that Paul Lindner announced. (Its a more sensible name but I kind of liked ``WYSIGASP'').

The focus of many Perl tools like embperl, ePerl, HTML::Template, etc is mainly on generation of synthesized documents at delivery time. However in a lot of cases sample pages may already exist that have been created with WYSIWIG tools such as HoTMetaL, FrontPage or Dreamweaver. Non-techies prefer to create their web pages using such WYSIWIG tools, but that these pages often lack consistency and the act of retrofitting perl code into those pages is arduous, even with embperl, HTML::Template and the like. Building pages with Perl, even with CGI.pm or other such modules, can be tedious and frustrating, especially if the client decides they want to change everything around.

Having been faced with this scenario one too many times I came up with the concept of WYSIGASP (What You See Is Good As a Starting Point) and this module, WWW::Publish, which lets you build web sites from existing pages, page fragments, bits of Perl, etc.

With WYSIGASP you let the client or their designers build the web site using whatever tools they like on a separate virtual host. Then a batch perl program reads the pages, parsing HTML pages and performs various edits on the pages to create a new web site. Pages with embedded code can be turned into Perl functions, that when called generate the HTML with the embedded code executed (a standalone program, html2perl, is provided that will convert a single HTML page to a perl module).

Preprocessing the entire web site can enable a consistent style to be enforced (see the example below), generally the generated web site should be regarded as a 'review' site, which will need to be checked before it goes live (which could well be done by just switching directories and signalling the server).

Preprocessing a web site has a couple of added benefits. Most web pages contain redundant tags (especially META tags inserted by HTML editors), whitespace and comments. Stripping these out ahead of delivery time reduces the file size, which reduces download times and server load. Some server side include directives, such as automatically including the file's last modified time, or other static files, can also be resolved at this time, again reducing the amount of processing the web server has to do. The module could even automate some of the checks such as link checking, checking for orphan files, etc (such functionality may well be added).

WARNING development of this module is still at an early change. I would be very grateful for comments, constructive criticism and suggestions. Suggestions for a better name are welcome; or do people like the tongue-in-cheek acronyms that Perl seems to be associated with?

EXAMPLES

The following example program shows how WebSite::WYSIGASP can be used to publish a ``work-in-progress'' web to a ``review'' web, while ensuring that every page has a consistent footer:

    #!/usr/bin/perl

    use WWW::Publish;
    use WWW::Publish::Content::Page;

    my $srcdir = '/web/wip/htdocs';
    my $dstdir = '/web/review/htdocs'
    my $footer = WWW::Publish::Content::Page
      ->open('/web/wip/templates/footer.html')
         ->element(id => 'footer');

    my $site = WWW::Publish->open(srcdir => $srcdir,
                                  dstdir => $dstdir);

    while (my $page = $site->next_page) {
        if ($page->mime_type eq 'text/html') {
           if ($page->contains(id => 'footer')) {
               $page->replace(id => 'footer', value => $footer);
           }
           else {
               print(STDERR $page->filename,
                     ": does not have a marked footer element\n");
        }
    }
    exit(0);

The source HTML files would contain something like:

    <DIV ID=footer>standard footer goes here</DIV>

A handle onto each page is returned by the ``next_page'' method. When the handle goes out of scope the destructor method is called, which copies or writes the file (if a destination directory was specified when the site was opened). In this example pages that are MIME type ``text/html'' are updated.

By default the module optimizes HTML files by removing redundant whitespace and comments, so even a null loop can be useful, i.e.

     1 while($site->next_page);

In fact the site destructor implicitly performs such a loop unless explicitly told that processing has been finished, so the following four line program will copy a web-site, optimizing any HTML files it finds:

    #!/usr/bin/perl
    use WWW::Publish;
    WWW::Publish->open(srcdir => '/web/wip/htdocs'
                       dstdir => '/web/review/htdocs');

Perl code can be embedded in or inserted into a web page enclosed in ASP delimiters. Such pages can be converted into perl modules.

For example given the file:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
    <html>
      <head>
        <title>Testing</title>
      </head>
      <body bgcolor = "white" >
        <h1>A sample HTML document</h1>

        This is a sample HTML document to demonstrate WWW::Publish.

        <p>It contains some embedded code using ASP syntax:

        <p><% print "the answer is $answer" %>

        <!-- and a comment to be stripped -->

      </body>
    </html>

The module can generate the following output (the lines are broken to fit on the page):

    package TestDocs;
    
    sub sample {my $answer=shift;
    print "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\"><HTML> <HEAD> \
    <TITLE>Testing</TITLE> </HEAD> <BODY BGCOLOR=\"white\"> <H1>A sample \
    HTML document</H1> This is a sample HTML document to demonstrate \
    WWW::Publish. <P>It contains some embedded code: <P>"; print \
    "the answer is $answer"; print "  </BODY> </HTML>\n";
    }

1;

This text could be written to a file TestDocs.pm which could then be ``use'd'' by a CGI script and the function sample invoked to generated the complete HTML document (where $answer is 42, of course):

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"><HTML> <HEAD>
    <TITLE>Testing</TITLE> </HEAD> <BODY BGCOLOR="white"> 
    <H1>A sample HTML document</H1> This is a sample HTML 
    document to demonstrate WWW::Publish. <P>It contains
    some embedded code: <P>the answer is 42 </BODY> </HTML>

Again the lines have been broken - there were no newlines in the output. The size of which has come down from 384 to 284 bytes, a saving of 26%.

AVAILABILITY

WWW::Publish is available from http://www.ford-mason.co.uk/resources

REQUIREMENTS

Perl version 5.005
an ANSI C compiler (tested with gcc 2.7.2.3 on Red Hat Linux 5.2)

In addition the following Perl modules from CPAN are needed:

File::MMagic
HTML::Element
HTML::Entities
HTML::Parser
HTML::TreeBuilder
IO::File
Parse::ePerl
Test

INSTALLATION

Installation is performed with the standard Perl mantra:

    perl Makefile.PL
    make
    make test
    make install

You will need the Test module to run the tests. This module is standard with Perl 5.005 and also included in 5.004_05.

FUTURE DIRECTIONS

The first thing is to settle on an acceptable interface and iron out any problems with particular HTML constructs.

There is some functionality I would like to provide, but I am not sure how it should be done. For example you can locate an element using the ID attribute, but it would be useful in the absence of an identifier to be able to say. Suppose you have a document with a mock-up of a table, that doesn't have an ID, but which has sample data in it, and you want to replace all the rows from the second to the penulitmate, with some embedded perl code; perhaps this could be expressed as:

    $html->replace(element => 'document.table(1)',
                   what    => '.row(2..$-1)',
                   value   => '<% generate_rows($self) %>');

Currently the 'what' argument is limited to ATTRIBUTE, CONTENT or ELEMENT and elements must be identified by their ID, by saying ``id => $id''.

If you have any ideas about such syntax, please contact me.

TO DO

The development of the WWW::Publish modules is still at an early stage and the focus is on getting the core functionality working correctly. Future enhancements may include:

performance tuning
logging of actions taken
remapping sections of the web
optional intra-site link checking
optional HTML validation (both of entire documents and fragments as they are manipulated)
other optional file filtering (e.g. compressing)
pattern substitution in text files
parsing of web server configuration files to determine defaults
further optimization of HTML files
optional linking files rather than copying

And this doesn't even touch on XML!

CHANGES

Version 0.1 released on 1 July 1999

ACKNOWLEDGEMENTS

Developing this package would not have been so straightforward had it not been for the existence of Gisle Aas's HTML::Parser and related modules, and of course Perl itself. My copies of Perl Cookbook (by Tom Christiansen and Nathan Torkington, published by O'Reilly & Associates) and Effective Perl Programming (by Joseph N. Hall with Randal L. Schwartz, published by Addison Wesley) have both become very well thumbed. I recommend both books, as well as the Perl bible.

AUTHOR

Andrew Ford <[email protected]>

LICENSE

Copyright (C) 1999, Ford & Mason Ltd. All rights reserved. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the ``Artistic License'' or the ``GNU General Public License''.

WARRANTY

This software comes with absolutely NO WARRANTY of any kind, i.e.:

 IN NO EVENT SHALL THE AUTHORS BE LIABLE TO ANY PARTY FOR DIRECT,
 INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF
 THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES
 THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

 THE AUTHORS SPECIFICALLY DISCLAIM ANY WARRANTIES, INCLUDING, BUT NOT
 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE IS PROVIDED ON
 AN ``AS IS'' BASIS, AND THE AUTHORS HAVE NO OBLIGATION TO PROVIDE
 MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.