Consuming Xml

lua-users home
wiki

Consuming iterators for XML trees

(This is part of LazyKit.)

See XmlIter for non-consuming versions of these functions.

xpairs_c(tree), xnpairs_c(tree), and xmliter.switch_c(parent, ftable, [opts]) also iterate over the children of tree, but they consume the children of tree as they process it. The following two fragments have similar semantics:

for i,x in xpairs(parent) do

  parent[i] = nil

  do_something_with(x)

end



for i,x in xpairs_c(parent) do

  do_something_with(x)

end 

Using a consuming iterator means that you do not care about accessing previously processed trees through parent. However, you can still save the current tree for later use:

for i,x,name in xnpairs(parent) do

  if x.name == "xref" then

    table.insert(references, x)

  end

end 

The primary reason to use consuming iterators is to reduce memory usage. When using conventional XML trees, this may help a little if you are building up another data structure while tearing down the XML tree; parts of the tree you have already processed are eligible for garbage collection, saving space for your new structure.

However, when using LazyTree XML trees, memory usage can be vastly smaller. Consider processing a large log file:

<log>

  <entry>[....]</entry>

  <entry>[....]</entry>

  [...millions of elements later...]

  <entry>[....]</entry>

</log> 

With a conventional XML tree, processing this requires space linearly proportional to the size of all the <entry> elements. With normal iterators and a lazy tree, this requires space linearly proportional to all previously processed <entry> elements (as future elements are only read on demand.) With consuming iterators and a lazy tree, processing only requires space proportional to the size of a single <entry> element, as previously processed <entry>s have been forgotten.

A secondary benefit to consuming iterators is that they may reduce CPU usage a small amount. The Lua 5.0 garbage collector does not have to work as hard during collections when less live data is present. (??? reread the GC algorithm to make sure this is true, have timing numbers though.)

What is really going on here is that iterators provide an event-based interface to tables. Consuming iterators provide many of the same benefits as pure event-based XML parsers, while allowing you to fluidly switch back and forth to a tree-based API when that makes sense.

One concrete example is XML-RPC <value> nodes. If they contain an element such as <integer>, that's the value of the <value> node. Otherwise, the character data content of the <value> element is a string-typed value. The code to process the element must potentially look ahead an arbitrary number of initial character data nodes to find out whether there is an element lurking inside, or it must default to string content. An event-based strategy can treat the list of children of an element as an arbitrary-length lookahead buffer.

Usage hints

It is always safe to replace a consuming iterator with a non-consuming iterator; the only consequence may be memory exhaustion when processing huge documents.

It makes the most sense to use a consuming iterator only as the last step in processing a tree. Because of how lazy XML trees work, it is not an error to touch child nodes before calling a consuming iterator.

When recursively processing elements, you should only call a consuming iterator if you know your caller no longer cares about its contents. A rule of thumb is to only call a consuming iterator inside another consuming iterator.


RecentChanges · preferences
edit · history
Last edited February 28, 2004 11:34 pm GMT (diff)