XML означает Extensible Markup Language и относится к той же группе обьектов, что и XSLT, XPath, XLink, WDDX и т.д.

В XML используются теговая структура,аналогично тому,как это делается в HTML. XML позволяет использовать собственные теги.

Данные в формате XML нужно трансформировать во что-то более читаемое. Для этого используется XSL - Extensible Style Language.

Большинство броузеров не имеют встроенных XML-парсеров или XSL-процессоров.

Решение в том,чтобы написать промежуточный слой между клиентом и сервером, который распарсит XML и вернет читабельный вывод. Тут на помощь приходит перл - он поддерживает XML-парсинг (DOM и XML packages), и может делать XSL-преобразование (Sablotron processor).

В перле есть 2 метода для парсинга XML.

The first of these approaches is SAX, the Simple API for XML. A SAX parser works by traversing an XML document and calling specific functions as it encounters different types of tags. For example, I might call a specific function to process a starting tag, another function to process an ending tag, and a third function to process the data between them.

The parser's responsibility is simply to parse the document; the functions it calls are responsible for processing the tags found. Once the tag is processed, the parser moves on to the next element in the document, and the process repeats itself.

Perl comes with a SAX parser based on the expat library created by James Clark; it's implemented as a Perl package named XML::Parser, and currently maintained by Clark Cooper. If you don't already have it, you should download and install it before proceeding further; you can get a copy from http://wwwx.netheaven.com/~coopercc/xmlparser/, or from CPAN (http://www.cpan.org/).

I'll begin by putting together a simple XML file:

<?xml version="1.0"?>

<library>
   <book>
      <title>Dreamcatcher</title>
      <author>Stephen King</author>
      <genre>Horror</genre>
      <pages>899</pages>
      <price>23.99</price>
      <rating>5</rating>
   </book>

   <book>
      <title>Mystic River</title>
      <author>Dennis Lehane</author>
      <genre>Thriller</genre>
      <pages>390</pages>
      <price>17.49</price>
      <rating>4</rating>
   </book>

   <book>
      <title>The Lord Of The Rings</title>
      <author>J. R. R. Tolkien</author>
      <genre>Fantasy</genre>
      <pages>3489</pages>
      <price>10.99</price>
      <rating>5</rating>
   </book>

</library>

Once my data is in XML-compliant format, I need to decide what I'd like the final output to look like.

Let's say I want it to look like this:

Output image

As you can see, this is a simple table containing columns for the book title, author, price and rating. (I'm not using all the information in the XML file). The title of the book is printed in italics, while the numerical rating is converted into something more readable.

Next, I'll write some Perl code to take care of this for me.

The first order of business is to initialize the XML parser, and set up the callback functions.

#!/usr/bin/perl

# include package
use XML::Parser;

# initialize parser
$xp = new XML::Parser();

# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);

# parse XML
$xp->parsefile("library.xml");

The parser is initialized in the ordinary way - by instantiating a new object of the Parser class. This object is assigned to the variable $xp, and is used in subsequent function calls.

# initialize parser
$xp = new XML::Parser();

The next step is to specify the functions to be executed when the parser encounters the opening and closing tags of an element. The setHandlers() method is used to specify these functions; it accepts a hash of values, with keys containing the events to watch out for, and values indicating which functions to trigger.

# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);

In this case, the user-defined functions start() and end() are called when starting and ending element tags are encountered, while character data triggers the cdata() function.

Obviously, these aren't the only types of events a parser can be set up to handle - the XML::Parser package allows you to specify handlers for a diverse array of events; I'll discuss these briefly a little later.

The next step in the script above is to open the XML file, read it and parse it via the parsefile() method. The parsefile() method will iterate through the XML document, calling the appropriate handling function each time it encounters a specific data type.

# parse XML
$xp->parsefile("library.xml");

In case your XML data is not stored in a file, but in a string variable - quite likely if, for example, you've generated it dynamically from a database - you can replace the parsefile() method with the parse() method, which accepts a string variable containing the XML document, rather than a filename.

Once the document has been completely parsed, the script will proceed to the next line (if there is one), or terminate gracefully. A parse error - for example, a mismatched tag or a badly-nested element - will cause the script to die immediately.

As you can see, this is fairly simple - simpler, in fact, than the equivalent process in other languages like PHP or Java. Don't get worried, though - this simplicity conceals a fair amount of power.

As I've just explained, the start(), end() and cdata() functions will be called by the parser as it progresses through the document. We haven't defined these yet - let's do that next:

# keep track of which tag is currently being processed
$currentTag = "";

# this is called when a start tag is found
sub start()
{
   # extract variables
   my ($parser, $name, %attr) = @_;

   $currentTag = lc($name);

   if ($currentTag eq "book")
   {
      print "<tr>";
   }
   elsif ($currentTag eq "title")
   {
      print "<td>";
   }
   elsif ($currentTag eq "author")
   {
      print "<td>";
   }
   elsif ($currentTag eq "price")
   {
      print "<td>";
   }
   elsif ($currentTag eq "rating")
   {
      print "<td>";
   }

}

Each time the parser encounters a starting tag, it calls start() with the name of the tag (and attributes, if any) as arguments. The start() function then processes the tag, printing corresponding HTML markup in place of the XML tag.

I've used an "if" statement, keyed on the tag name, to decide how to process each tag. For example, since I know that <book> indicates the beginning of a new row in my desired output, I replace it with a <tr>, while other elements like <title> and <author> correspond to table cells, and are replaced with <td> tags.

In case you're wondering, I've used the lc() function to convert the tag name to lowercase before performing the comparison; this is necessary to enforce consistency and to ensure that the script works with XML documents that use upper-case or mixed-case tags.

Finally, I've also stored the current tag name in the global variable $currentTag - this can be used to identify which tag is being processed at any stage, and it'll come in useful a little further down.

The end() function takes care of closing tags, and looks similar to start() - note that I've specifically cleaned up $currentTag at the end.

# this is called when an end tag is found
sub end()
{
   my ($parser, $name) = @_;
   $currentTag = lc($name);
   if ($currentTag eq "book")
   {
      print "</tr>";
   }
   elsif ($currentTag eq "title")
   {
      print "</td>";
   }
   elsif ($currentTag eq "author")
   {
      print "</td>";
   }
   elsif ($currentTag eq "price")
   {
      print "</td>";
   }
   elsif ($currentTag eq "rating")
   {
      print "</td>";
   }

   # clear value of current tag
   $currentTag = "";
}

Note that empty elements generate both start and end events.

So this takes care of replacing XML tags with corresponding HTML tags...but what about handling the data between them?

# this is called when CDATA is found
sub cdata()
{
   my ($parser, $data) = @_;
   my @ratings = ("Words fail me!", "Terrible", "Bad", "Indifferent", "Good", "Excellent");

   if ($currentTag eq "title")
   {
      print "<i>$data</i>";
   }
   elsif ($currentTag eq "author")
   {
      print $data;
   }
   elsif ($currentTag eq "price")
   {
      print "\$$data";
   }
   elsif ($currentTag eq "rating")
   {
      print $ratings[$data];
   }

}

The cdata() function is called whenever the parser encounters data between an XML tag pair. Note, however, that the function is only passed the data as argument; there is no way of telling which tags are around it. However, since the parser processes XML chunk-by-chunk, we can use the $currentTag variable to identify which tag this data belongs to.

Depending on the value of $currentTag, an "if" statement is used to print data with appropriate formatting; this is the place where I add italics to the title, a currency symbol to the price, and a text rating (corresponding to a numerical index) from the @ratings array.

Here's what the finished script (with some additional HTML, so that you can use it via CGI) looks like:

#!/usr/bin/perl

# include package
use XML::Parser;

# initialize parser
$xp = new XML::Parser();

# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);

# keep track of which tag is currently being processed
$currentTag = "";

# send standard header to browser
print "Content-Type: text/html\n\n";

# set up HTML page
print "<html><head></head><body>";
print "<h2>The Library</h2>";
print "<table border=1 cellspacing=1 cellpadding=5>";
print "<tr><td align=center>Title</td><td align=center>Author</td><td align=center>Price</td><td align=center>User Rating</td></tr>";

# parse XML
$xp->parsefile("library.xml");

print "</table></body></html>";

# this is called when a start tag is found
sub start()
{
   # extract variables
   my ($parser, $name, %attr) = @_;

   $currentTag = lc($name);

   if ($currentTag eq "book")
   {
      print "<tr>";
   }
   elsif ($currentTag eq "title")
   {
      print "<td>";
   }
   elsif ($currentTag eq "author")
   {
      print "<td>";
   }
   elsif ($currentTag eq "price")
   {
      print "<td>";
   }
   elsif ($currentTag eq "rating")
   {
      print "<td>";
   }

}

# this is called when CDATA is found
sub cdata()
{
   my ($parser, $data) = @_;
   my @ratings = ("Words fail me!", "Terrible", "Bad", "Indifferent", "Good", "Excellent");

   if ($currentTag eq "title")
   {
      print "<i>$data</i>";
   }
   elsif ($currentTag eq "author")
   {
      print $data;
   }
   elsif ($currentTag eq "price")
   {
      print "\$$data";
   }
   elsif ($currentTag eq "rating")
   {
      print $ratings[$data];
   }

}

# this is called when an end tag is found
sub end()
{
   my ($parser, $name) = @_;
   $currentTag = lc($name);
   if ($currentTag eq "book")
   {
      print "</tr>";
   }
   elsif ($currentTag eq "title")
   {
      print "</td>";
   }
   elsif ($currentTag eq "author")
   {
      print "</td>";
   }
   elsif ($currentTag eq "price")
   {
      print "</td>";
   }
   elsif ($currentTag eq "rating")
   {
      print "</td>";
   }

   # clear value of current tag
   $currentTag = "";
}

# end

And when you run it, here's what you'll see:

Output image

You can now add new items to your XML document, or edit existing items, and your rendered HTML page will change accordingly. By separating the data from the presentation, XML has imposed standards on data collections, making it possible, for example, for users with no technical knowledge of HTML to easily update content on a Web site, or to present data from a single source in different ways.

In addition to elements and CDATA, Perl also allows you to set up handlers for other types of XML structures, most notably PIs, entities and notations (if you don't know what these are, you might want to skip this section and jump straight into another, more complex example on the next page). As demonstrated in the previous example, handlers for these structures are set up by specifying appropriate callback functions via a call to the setHandlers() object method.

Here's a quick list of the types of events that the parser can handle, together with a list of their key names (as expected by the setHandlers() method) and a list of the arguments that the corresponding callback function will receive.

Key Arguments    Event
   to callback
------------------------------------------------------------------------
Final parser handle    Document parsing completed

Start parser handle, Start tag found
   element name,
   attributes

End parser handle, End tag found
   element name

Char    parser handle, CDATA found
   CDATA

Proc    parser handle, PI found
   PI target,
   PI data

Comment parser handle, Comment found
   comment

Unparsed   parser handle, entity, Unparsed entity found
   base, system ID, public
   ID, notation

Notation   parser handle, notation, Notation found
   base, system ID, public
   ID

XMLDecl parser handle, XML declaration found
   version, encoding,
   standalone

ExternEnt parser handle, base, External entity found
   system ID, public ID

Default parser handle, data    Default handler

As an example, consider the following example, which uses a simple XML document,

<?xml version="1.0"?>
<random>
<?perl print rand(); ?>
</random>

in combination with this Perl script to demonstrate how to handle processing instructions (PIs):

#!/usr/bin/perl

# include package
use XML::Parser;

# initialize parser
$xp = new XML::Parser();

# set PI handler
$xp->setHandlers(Proc => \&pih);

# output some HTML
print "Content-Type: text/html\n\n";
print "<html><head></head><body>And the winning number is: ";
$xp->parsefile("pi.xml");
print "</body></html>";

# this is called whenever a PI is encountered
sub pih()
{
   # extract data
   my ($parser, $target, $data) = @_;

   # if Perl command
   if (lc($target) == "perl")
   {
      # execute it
      eval($data);
   }
}

# end

In this case, the setHandlers() method knows that it has to call the subroutine pih() when it encounters a processing instruction in the XML data; this user-defined pih() function is automatically passed the PI target and the actual command to be executed. Assuming the command is a Perl command - as indicated by the target name - the function passes it on to eval() for execution.

Here's another, slightly more complex example using the SAX parser, and one of my favourite meals.

<?xml version="1.0"?>

<recipe>

   <name>Chicken Tikka</name>
   <author>Anonymous</author>
   <date>1 June 1999</date>

   <ingredients>

      <item>
         <desc>Boneless chicken breasts</desc>
         <quantity>2</quantity>
      </item>

      <item>
         <desc>Chopped onions</desc>
         <quantity>2</quantity>
      </item>

      <item>
         <desc>Ginger</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Garlic</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Red chili powder</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Coriander seeds</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Lime juice</desc>
         <quantity>2 tbsp</quantity>
      </item>

      <item>
         <desc>Butter</desc>
         <quantity>1 tbsp</quantity>
      </item>
   </ingredients>

   <servings>
   3
   </servings>

   <process>
      <step>Cut chicken into cubes, wash and apply lime juice and salt</step>
      <step>Add ginger, garlic, chili, coriander and lime juice in a separate bowl</step>
      <step>Mix well, and add chicken to marinate for 3-4 hours</step>
      <step>Place chicken pieces on skewers and barbeque</step>
      <step>Remove, apply butter, and barbeque again until meat is tender</step>
      <step>Garnish with lemon and chopped onions</step>
   </process>

</recipe>

This time, my Perl script won't be using an "if" statement when I parse the file above; instead, I'm going to be keying tag names to values in a hash. Each of the tags in the XML file above will be replaced with appropriate HTML markup.

#!/usr/bin/perl

# hash of tag names mapped to HTML markup
# "recipe" => start a new block
# "name" => in bold
# "ingredients" => unordered list
# "desc" => list items
# "process" => ordered list
# "step" => list items

%startTags = (
"recipe" => "<hr>",
"name" => "<font size=+2>",
"date" => "<i>(",
"author" => "<b>",
"servings" => "<i>Serves ",
"ingredients" => "<h3>Ingredients:</h3><ul>",
"desc" => "<li>",
"quantity" => "(",
"process" => "<h3>Preparation:</h3><ol>",
"step" => "<li>"
);

# close tags opened above
%endTags = (
"name" => "</font><br>",
"date" => ")</i>",
"author" => "</b>",
"ingredients" => "</ul>",
"quantity" => ")",
"servings" => "</i>",
"process" => "</ol>"
);

# name of XML file
$file = "recipe.xml";

# this is called when a start tag is found
sub start()
{
   # extract variables
   my ($parser, $name, %attr) = @_;

   # lowercase element name
   $name = lc($name);

   # print corresponding HTML
   if ($startTags{$name})
   {
      print $startTags{$name};
   }
}

# this is called when CDATA is found
sub cdata()
{
   my ($parser, $data) = @_;
   print $data;
}

# this is called when an end tag is found
sub end()
{
   my ($parser, $name) = @_;
   $name = lc($name);
   if ($endTags{$name})
   {
      print $endTags{$name};
   }
}

# include package
use XML::Parser;

# initialize parser
$xp = new XML::Parser();

# set callback functions
$xp->setHandlers(Start => \&start, End => \&end, Char => \&cdata);

# send standard header to browser
print "Content-Type: text/html\n\n";

# print HTML header
print "<html><head></head><body>";

# parse XML
$xp->parsefile($file);

# print HTML footer
print "</body></html>";

# end

In this case, I've set up two hashes, one for opening tags and one for closing tags. When the parser encounters an XML tag, it looks up the hash to see if the tag exists as a key. If it does, the corresponding value (HTML markup) is printed. This method does away with the slightly cumbersome branching "if" statements of the previous example, and is easier to read and understand.

Here's the output:

Perl comes with a DOM parser based on the expat library created by James Clark; it's implemented as a Perl package named XML::DOM, and currently maintained by T. J. Mather. If you don't already have it, you should download and install it before proceeding further; you can get a copy from CPAN (http://www.cpan.org/).

This DOM parser works by reading an XML document and creating objects to represent the different parts of that document. Each of these objects comes with specific methods and properties, which can be used to manipulate and access information about it. Thus, the entire XML document is represented as a "tree" of these objects, with the DOM parser providing a simple API to move between the different branches of the tree.

The parser itself supports all the different structures typically found in an XML document - elements, attributes, namespaces, entities, notations et al - but our focus here will be primarily on elements and the data contained within them. If you're interested in the more arcane aspects of XML - as you will have to be to do anything complicated with the language - the XML::DOM package comes with some truly excellent documentation, which gets installed when you install the package. Make it your friend, and you'll find things considerably easier.

Let's start things off with a simple example:

#!/usr/bin/perl

# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe Cool</name><age>24</age><sex>male</sex></me>";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

# print tree as string
print $doc->toString();

# end

In this case, a new instance of the parser is created and assigned to the variable $xp. This object instance can now be used to parse the XML data via its parse() function:

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

You'll remember the parse() function from the first part of this article - it was used by the SAX parser to parse a string. When you think about it, this isn't really all that remarkable - the XML::DOM package is built on top of the XML::Parser package, and therefore inherits many of the latter's methods.

With that in mind, it follows that the DOM parser should also be able to read an XML file directly, simply by using the parsefile() method, instead of the parse() method:

#!/usr/bin/perl

# XML file
$file = "me.xml";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parsefile($file);

# print tree as string
print $doc->toString();

# end

The results of successfully parsing an XML document - whether string or file - is an object representation of the XML document (actually, an instance of the Document class). In the example above, this object is called $doc.

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parsefile($file);

This Document object comes with a bunch of interesting methods - and one of the more useful ones is the toString() method, which returns the current document tree as a string. In the examples above, I've used this method to print the entire document to the console.

# print tree as string
print $doc->toString();

It should be noted that this isn't all that great an example of how to use the toString() method. Most often, this method is used during dynamic XML tree generation, when an XML tree is constructed in memory from a database or elsewhere. In such situations, the toString() method comes in handy to write the final XML tree to a file or send it to a parser for further processing.

The Document object comes with another useful method, one which enables you to gain access to information about the document's XML version and character encoding. It's called the getXMLDecl() method, and it returns yet another object, this one representing the standard XML declaration that appears at the top of every XML document. Take a look:

#!/usr/bin/perl

# create an XML-compliant string
$xml = "<?xml version=\"1.0\" encoding=\"utf-8\"?><me><name>Joe Cool</name><age>24</age><sex>male</sex></me>";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

# get XML PI
$decl = $doc->getXMLDecl();

# get XML version
print $decl->getVersion();

# get encoding
print $decl->getEncoding();

# get whether standalone
print $decl->getStandalone();

# end

As you can see, the newly-created XMLDecl object comes with a bunch of object methods of its own. These methods provide a simple way to access the document's XML version, character encoding and status.

Using the Document object, it's also possible to obtain references to other nodes in the XML tree, and manipulate them using standard methods. Since the entire document is represented as a tree, the first step is always to obtain a reference to the tree root, or the outermost document element, and use this a stepping stone to other, deeper branches. Consider the following example, which demonstrates how to do this:

#!/usr/bin/perl

# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe Cool</name><age>24</age><sex>male</sex></me>";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

# get root node "me"
$root = $doc->getDocumentElement();

# end

An option here would be to use the getChildNodes() method, which is a common method available to every single node in the document tree. The following code snippet is identical to the one above:

#!/usr/bin/perl

# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe Cool</name><age>24</age><sex>male</sex></me>";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

# get root node "me"
@children = $doc->getChildNodes();
$root = $children[0];

# end

Note that the getChildNodes() method returns an array of nodes under the current node; each of these nodes is again an object instance of the Node class, and comes with methods to access the node name, type and content. Let's look at that next.

Once you've obtained a reference to a node, a number of other methods become available to help you obtain the name and value of that node, as well as references to parent and child nodes. Take a look:

#!/usr/bin/perl

# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me><name>Joe Cool</name><age>24</age><sex>male</sex></me>";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

# get root node
$root = $doc->getDocumentElement();

# get name of root node
# returns "me"
print $root->getNodeName();

# get children as array
@children = $root->getChildNodes();

# this is the "name" element under "me"
# I could also have used $root->getFirstChild() to get here
$firstChild = $children[0];

# returns "name"
print $firstChild->getNodeName();

# returns "1"
print $firstChild->getNodeType();

# now to access the value of the text node under "name"
$text = $firstChild->getFirstChild();

# returns "Joe Cool"
print $text->getData();

# returns "#text"
print $text->getNodeName();

# returns "3"
print $text->getNodeType();

# go back up the tree
# start from the "name" element and get its parent
$parent = $firstChild->getParentNode();

# check the name - it should be "me"
# yes it is!
print $parent->getNodeName();

# end

As you can see, the getNodeName() and getNodeType() methods provide access to basic information about the node currently under examination. The children of this node can be obtained with the getChildNodes() method previously discussed, and node parents can be obtained with the getParentNode() method. It's fairly simple, and - once you play with it a little - you'll get the hang of how it works.

A quick note on the getNodeType() method above: every node is of a specific type, and this property returns a numeric code corresponding to the type. A complete list of defined types is available in the Perl documentation for the XML::DOM package.

Note also that the text within an element's opening and closing tags is treated as a child node of the corresponding element node, and is returned as an object. This object comes with a getData() method, which returns the actual content nested within the element's opening and closing tags. You'll see this again in a few pages.

Just as it's possible to access elements and their content, it's also possible to access element attributes and their values. The getAttributes() method of the Node object provides access to a list of all available attributes, and the getNamedItem() and getValue() methods make it possible to access specific attributes and their values. Take a look at a demonstration of how it all works:

#!/usr/bin/perl

# create an XML-compliant string
$xml = "<?xml version=\"1.0\"?><me species=\"human\"><name>Joe Cool</name><age>24</age><sex>male</sex></me>";

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parse($xml);

# get root node (Node object)
$root = $doc->getDocumentElement();

# get attributes (NamedNodeMap object)
$attribs = $root->getAttributes();

# get specific attribute (Attr object)
$species = $attribs->getNamedItem("species");

# get value of attribute
# returns "human"
print $species->getValue();

# end

Getting to an attribute value is a little more complicated than getting to an element. But hey - no gain without pain, right?

Using this information, it's pretty easy to re-create our first example using the DOM parser. Here's the XML data,

<?xml version="1.0"?>

<library>
   <book>
      <title>Dreamcatcher</title>
      <author>Stephen King</author>
      <genre>Horror</genre>
      <pages>899</pages>
      <price>23.99</price>
      <rating>5</rating>
   </book>

   <book>
      <title>Mystic River</title>
      <author>Dennis Lehane</author>
      <genre>Thriller</genre>
      <pages>390</pages>
      <price>17.49</price>
      <rating>4</rating>
   </book>

   <book>
      <title>The Lord Of The Rings</title>
      <author>J. R. R. Tolkien</author>
      <genre>Fantasy</genre>
      <pages>3489</pages>
      <price>10.99</price>
      <rating>5</rating>
   </book>

</library>

and here's the script which does all the work.

#!/usr/bin/perl

# XML file
$file = "library.xml";

# array of ratings
@ratings = ("Words fail me!", "Terrible", "Bad", "Indifferent", "Good", "Excellent");

# include package
use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parsefile($file);

# set up HTML page
print "Content-Type: text/html\n\n";
print "<html><head></head><body>";
print "<h2>The Library</h2>";
print "<table border=1 cellspacing=1 cellpadding=5> <tr> <td align=center>Title</td> <td align=center>Author</td> <td align=center>Price</td> <td align=center>User Rating</td> </tr>";

# get root node
$root = $doc->getDocumentElement();

# get children
@books = $root->getChildNodes();

# iterate through book list
foreach $node (@books)
{
   print "<tr>";
   # if element node
   if ($node->getNodeType() == 1)
   {
      # get children
      # this is the "title", "author"... level
      @children = $node->getChildNodes();

      # iterate through child nodes
      foreach $item (@children)
      {
         # check element name
         if (lc($item->getNodeName) eq "title")
         {
            # print text node contents under this element
            print "<td><i>" . $item->getFirstChild()->getData . "</i></td>";
         }
         elsif (lc($item->getNodeName) eq "author")
         {
            print "<td>" . $item->getFirstChild()->getData . "</td>";
         }
         elsif (lc($item->getNodeName) eq "price")
         {
            print "<td>\$" . $item->getFirstChild()->getData . "</td>";
         }
         elsif (lc($item->getNodeName) eq "rating")
         {
            $num = $item->getFirstChild()->getData;
            print "<td>" . $ratings[$num] . "</td>";
         }
      }
   }
   print "</tr>";
}

print "</table></body></html>";

# end

This may appear complex, but it isn't really all that hard to understand. I've first obtained a reference to the root of the document tree, $root, and then to the children of that root node; these children are returned as a regular Perl array. I've then used a "foreach" loop to iterate through the array, navigate to the next level, and print the content found in the nodes, with appropriate formatting. The numerous "if" statements you see are needed to check the name of each node and then add appropriate HTML formatting to it.

As explained earlier, the data itself is treated as a child text node of the corresponding element node. Therefore, whenever I find an element node, I've used the node's getFirstChild() method to access the text node under it, and the getData() method to extract the data from that text node.

Here's what it looks like:

Output image

I can do the same thing with the second example as well. However, since there are quite a few levels to the document tree, I've decided to use a recursive function to iterate through the tree, rather than a series of "if" statements.

Here's the XML file,

<?xml version="1.0"?>

<recipe>

   <name>Chicken Tikka</name>
   <author>Anonymous</author>
   <date>1 June 1999</date>

   <ingredients>

      <item>
         <desc>Boneless chicken breasts</desc>
         <quantity>2</quantity>
      </item>

      <item>
         <desc>Chopped onions</desc>
         <quantity>2</quantity>
      </item>

      <item>
         <desc>Ginger</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Garlic</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Red chili powder</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Coriander seeds</desc>
         <quantity>1 tsp</quantity>
      </item>

      <item>
         <desc>Lime juice</desc>
         <quantity>2 tbsp</quantity>
      </item>

      <item>
         <desc>Butter</desc>
         <quantity>1 tbsp</quantity>
      </item>
   </ingredients>

   <servings>
   3
   </servings>

   <process>
      <step>Cut chicken into cubes, wash and apply lime juice and salt</step>
      <step>Add ginger, garlic, chili, coriander and lime juice in a separate
bowl</step>
      <step>Mix well, and add chicken to marinate for 3-4 hours</step>
      <step>Place chicken pieces on skewers and barbeque</step>
      <step>Remove, apply butter, and barbeque again until meat is tender</step>
      <step>Garnish with lemon and chopped onions</step>
   </process>

</recipe>

and here's the script which parses it.

#!/usr/bin/perl

# XML file
$file = "recipe.xml";

# hash of tag names mapped to HTML markup
# "recipe" => start a new block
# "name" => in bold
# "ingredients" => unordered list
# "desc" => list items
# "process" => ordered list
# "step" => list items
%startTags = (
"name" => "<font size=+2>",
"date" => "<i>(",
"author" => "<b>",
"servings" => "<i>Serves ",
"ingredients" => "<h3>Ingredients:</h3><ul>",
"desc" => "<li>",
"quantity" => "(",
"process" => "<h3>Preparation:</h3><ol>",
"step" => "<li>"
);

# close tags opened above
%endTags = (
"name" => "</font><br>",
"date" => ")</i>",
"author" => "</b>",
"ingredients" => "</ul>",
"quantity" => ")",
"servings" => "</i>",
"process" => "</ol>"
);

# this function accepts an array of nodes as argument,
# iterates through it and prints HTML markup for each tag it finds.
# for each node in the array, it then gets an array of the node's children, and
# calls itself again with the array as argument (recursion)
sub printData()
{
   my (@nodeCollection) = @_;
   foreach $node (@nodeCollection)
   {
      print $startTags{$node->getNodeName()};
      print $node->getFirstChild()->getData();
      my @children = &getChildren($node);
      printData(@children);
      print $endTags{$node->getNodeName()};
   }
}

# this function accepts a node
# and returns all the element nodes under it (its children)
# as an array
sub getChildren()
{
   my ($node) = @_;
   # get children of this node
   my @temp = $node->getChildNodes();
   my $count = 0;
   my @collection;

   # iterate through children
   foreach $item (@temp)
   {
      # if this is an element
      # (need this to strip out text nodes containing whitespace)
      if ($item->getNodeType() == 1)
      {
         # add it to the @collection array
         $collection[$count] = $item;
         $count++;
      }
   }

   # return node collection
   return @collection;
}

use XML::DOM;

# instantiate parser
$xp = new XML::DOM::Parser();

# parse and create tree
$doc = $xp->parsefile($file);

# send standard header to browser
print "Content-Type: text/html\n\n";

# print HTML header
print "<html><head></head><body><hr>";

# get root node
$root = $doc->getDocumentElement();

# get children
@children = &getChildren($root);

# run a recursive function starting here
&printData(@children);

print "</table></body></html>";

# end

In this case, I've utilized a slightly different method to mark up the XML. I've first initialized a couple of hashes to map XML tags to corresponding HTML markup, in much the same manner as I did last time. Next, I've used DOM functions to obtain a reference to the first set of child nodes in the DOM tree.

This initial array of child nodes is used to "seed" my printData() function, a recursive function which takes an array of child nodes, matches their tag names to values in the associative arrays, and outputs the corresponding HTML markup to the browser. It also obtains a reference to the next set of child nodes, via the getChildren() function, and calls itself with the new node collection as argument.

By using this recursive function, I've managed to substantially reduce the number of "if" conditional statements in my script; the code is now easier to read, and also structured more logically.

Here's what it looks like:

Output image

As you can see, you can parse a document using either DOM or SAX, and achieve the same result. The difference is that the DOM parser is a little slower, since it has to build a complete tree of the XML data, whereas the SAX parser is faster, since it's calling a function each time it encounters a specific tag type. You should experiment with both methods to see which one works better for you.

There's another important difference between the two techniques. The SAX approach is event-centric - as the parser travels through the document, it executes specific functions depending on what it finds. Additionally, the SAX approach is sequential - tags are parsed one after the other, in the sequence in which they appear. Both these features add to the speed of the parser; however, they also limit its flexibility in quickly accessing any node of the DOM tree.

As opposed to this, the DOM approach builds a complete tree of the document in memory, making it possible to easily move from one node to another (in a non-sequential manner). Since the parser has the additional overhead of maintaining the tree structure in memory, speed is an issue here; however, navigation between the various "branches" of the tree is easier. Since the approach is not dependent on events, developers need to use the exposed methods and attributes of the various DOM objects to process the XML data.

That just about concludes this little tour of parsing XML data with Perl. I've tried to keep it as simple as possible, and there are numerous aspects of XML I haven't covered here. If you're interested in learning more about XML and XSL, you should visit the following links:

The XML specification, at http://www.w3.org/TR/2000/REC-xml-20001006

The XSLT specification, at http://www.w3.org/TR/xslt.html

The SAX project, at http://www.saxproject.org/

The W3C's DOM specification, at http://www.w3.org/DOM/

A number of developers have built and released Perl packages to handle XML data - if you're ever on a tight deadline, using these packages might save you some development time. Take a look at the following links for more information:

The Perl XML module list, at http://www.perlxml.com/modules/perl-xml-modules.html

CPAN, at http://www.cpan.org/

The Perl XML FAQ, at http://www.perlxml.com/faq/perl-xml-faq.html

Оставьте свой комментарий !

Ваше имя:

Комментарий:

Оба поля являются обязательными

Автор	Комментарий к данной статье