There are three related methods available for iterating through an XML tree and finding nodes in the tree: The methods ``walk``, ``walknodes`` and ``walkpaths``. =================== The ``walk`` method =================== The method ``walk`` is a generator. When called without any arguments it visits each node in the tree once. Furthermore without arguments parent nodes are yielded before their children, and no attribute nodes are yielded. (This can however be changed by passing certain arguments to ``walk``.) What ``walk`` outputs is a ``Cursor`` object (in fact ``walk`` always yields the same cursor object, but the attributes will be updated during the traversal). A ``Cursor`` object has the following attributes: ``root`` The node where traversal has been started (i.e. the object for which the ``walk`` method has been called). ``node`` The current node being traversed. ``path`` A list of nodes that contains the path through the tree from the root to the current node (i.e. ``path[0]`` is ``root`` and ``path[-1]`` is ``node``). ``index`` A path of indices (e.g. ``[0, 1]`` if the current node is the second child of the first child of the root). Inside attributes the index path will contain the name of the attribute (or a (attribute name, namespace name) tuple inside a global attribute). ``event`` A string that specifies which event is currently being handled. Possible values are: ``"enterelementnode"``, ``"leaveelementnode"``, ``"enterattrnode"``, ``"leaveattrnode"``, ``"textnode"``, ``"commentnode"``, ``"doctypenode"``, ``"procinstnode"``, ``"entitynode"`` and ``"nullnode"``. The following example shows the basic usage of the ``walk`` method: ``>>> ```` from ll.xist.ns import html`` ``>>> ```` e = html.ul(html.li(i) for i in range(3))`` ``>>> ```` for cursor in e.walk():`` ``... ```` print("{0.event} {0.node!r}".format(cursor))`` ``... `````` enterelementnode enterelementnode textnode enterelementnode textnode enterelementnode textnode Using the ``walk`` method The ``path`` attribute can be used like this: ``>>> ```` from ll.xist.ns import html`` ``>>> ```` e = html.ul(html.li(i) for i in range(3))`` ``>>> ```` for cursor in e.walk():`` ``... ```` print(["{0.__module__}.{0.__qualname__}".format(n.__class__) for n in cursor.path])`` ``... `````` ['ll.xist.ns.html.ul'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text'] Using the ``path`` attribute The following example shows how the ``index`` attribute works: ``>>> ```` from ll.xist.ns import html`` ``>>> ```` e = html.ul(html.li(i) for i in range(3))`` ``>>> ```` for cursor in e.walk():`` ``... ```` print("{0.index} {0.node!r}".format(cursor))`` ``... `````` [] [0] [0, 0] [1] [1, 0] [2] [2, 0] Using the ``index`` attribute ============================================== Changing which parts of the tree are traversed ============================================== The ``walk`` method has a few additional parameters that specify which part of the tree should be traversed and in which order: ``entercontent`` (default ``True``) Should the content of an element be entered? Note that when you call ``walk`` with ``entercontent`` being false, ``walk`` will only yield the root node itself. ``enterattrs`` (default ``False``) Should the attributes of an element be entered? The following example shows the usage of ``enterattrs``: ``>>> `` ``from ll.xist.ns import html`` ``>>> `` ``e = html.ul(html.li(i, class_="li-{}".format(i)) for i in range(3))`` ``>>> `` ``for cursor in e.walk(enterattrs=True):`` ``... `` `` print("{}{!r}".format("\t"*(len(cursor.path)-1), cursor.node))`` ``... `` ```` ·   ·  ·   ·  ·   ·   ·  ·   ·  ·   ·   ·  ·   ·  ·   Using the ``enterattrs`` parameter When both ``entercontent`` and ``enterattrs`` are true, the attributes will always be entered before the content. Setting ``enterattrs`` to true will only visit the attribute nodes themselves, but not their content. ``enterattr`` (default ``False``) Should the content of the attributes of an element be entered? (This is only relevant if ``enterattrs`` is true.) The following example shows the usage of the ``enterattr`` parameter: ``>>> `` ``from ll.xist.ns import html`` ``>>> `` ``e = html.ul(html.li(i, class_="li-{}".format(i)) for i in range(3))`` ``>>> `` ``for cursor in e.walk(enterattrs=True, enterattr=True):`` ``... `` `` print("{}{!r}".format("\t"*(len(cursor.path)-1), cursor.node))`` ``... `` ```` ·   ·  ·   ·  ·  ·   ·  ·   ·   ·  ·   ·  ·  ·   ·  ·   ·   ·  ·   ·  ·  ·   ·  ·   Using the ``enterattr`` parameter ======================== Changing traversal order ======================== The default traversal order is "top down". The following ``walk`` parameters can be used to change that into "bottom up" order or into visiting each element or attribute both on the way down and up: ``enterelementnode`` (default ``True``) Should the generator yield the cursor before it enters an element (i.e. before it visits the attributes and content of the element)? The cursor attribute ``event`` will have the value ``"enterelementnode"`` in this case. ``leaveelementnode`` (default ``False``) Should the generator yield the cursor after it has visited an element? The cursor attribute ``event`` will have the value ``"leaveelementnode"`` in this case. Passing ``enterelementnode=False, leaveelementnode=True`` to ``walk`` will change "top down" traversal into "bottom up". ``enterattrnode`` (default ``True``) Should the generator yield the cursor before it enters an attribute? The cursor attribute ``event`` will have the value ``"enterattrnode"`` in this case. Note that the attribute will only be entered when ``enterattr`` is true and it will only be visited if ``enterattrs`` is true. ``leaveattrnode`` (default ``False``) Should the generator yield the cursor after it has visited an attribute? The cursor attribute ``event`` will have the value ``"leaveattrnode"`` in this case. Note that the attribute will only be entered when ``enterattr`` is true and it will only be visited if ``enterattrs`` is true. Passing ``True`` for all these parameters gives us the following output: ``>>> `` ``from ll.xist.ns import html`` ``>>> `` ``e = html.ul(html.li(i, class_="li-{}".format(i)) for i in range(3))`` ``>>> `` ``for cursor in e.walk(entercontent=True, enterattrs=True, enterattr=True,`` ``... `` ·  `` enterelementnode=True, leaveelementnode=True,`` ``... `` ·  `` enterattrnode=True, leaveattrnode=True):`` ``... `` `` print("{0}{1.event} {1.index} {1.node!r}".format("\t"*(len(cursor.path)-1), cursor))`` ``... `` ```` enterelementnode [] ·  enterelementnode [0] ·  ·  enterattrnode [0, 'class'] ·  ·  ·  textnode [0, 'class', 0] ·  ·  leaveattrnode [0, 'class'] ·  ·  textnode [0, 0] ·  leaveelementnode [0] ·  enterelementnode [1] ·  ·  enterattrnode [1, 'class'] ·  ·  ·  textnode [1, 'class', 0] ·  ·  leaveattrnode [1, 'class'] ·  ·  textnode [1, 0] ·  leaveelementnode [1] ·  enterelementnode [2] ·  ·  enterattrnode [2, 'class'] ·  ·  ·  textnode [2, 'class', 0] ·  ·  leaveattrnode [2, 'class'] ·  ·  textnode [2, 0] ·  leaveelementnode [2] leaveelementnode [] Full tree traversal ========================== Skipping parts of the tree ========================== It is possible to change the cursor attributes that specify the traversal order during the traversal to skip certain parts of the tree. In the following example the content of ``html.li`` elements is skipped if they have a ``class`` attribute: ``>>> `` ``from ll.xist.ns import html`` ``>>> `` ``e = html.ul(html.li(i, class_=None if i%2 else "li-{}".format(i)) for i in range(3))`` ``>>> `` ``for cursor in e.walk():`` ``... `` `` if isinstance(cursor.node, html.li) and "class_" in cursor.node.attrs:`` ``... `` `` cursor.entercontent = False`` ``... `` `` print("{0}{1.event} {1.node!r}".format("\t"*(len(cursor.path)-1), cursor))`` ``... `` ```` enterelementnode ·  enterelementnode ·  enterelementnode ·  ·  textnode ·  enterelementnode Skipping parts of the tree This works for the following attributes: * ``entercontent`` * ``enterattrs`` * ``enterattr`` * ``enterelementnode`` * ``leaveelementnode`` * ``enterattrnode`` * ``leaveattrnode`` After the ``walk`` generator has been reentered and the modified attribute has been taken into account all those attributes wil be reset to their initial value (i.e. the value that has been passed to ``walk``). =========================================== The methods ``walknodes`` and ``walkpaths`` =========================================== In addition to ``walk`` two other methods are available: ``walknodes`` and ``walkpaths``. These generators don't produce a cursor object like ``walk`` does. ``walknodes`` produces the node itself as the following example demonstrates: ``>>> ```` from ll.xist.ns import html`` ``>>> ```` e = html.ul(html.li(i) for i in range(3))`` ``>>> ```` for node in e.walknodes():`` ``... ```` print(repr(node))`` ``... `````` Using ``walknodes`` ``walkpaths`` produces the path. This is a copy of the path, so it won't be changed once ``walkpaths`` is reentered: ``>>> ```` from ll.xist.ns import html`` ``>>> ```` e = html.ul(html.li(i) for i in range(3))`` ``>>> ```` for path in e.walkpaths():`` ``... ```` print(["{0.__module__}.{0.__qualname__}".format(n.__class__) for n in path])`` ``... `````` ['ll.xist.ns.html.ul'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li'] ['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text'] Using ``walkpaths`` ========================================== Filtering the output of the tree traversal ========================================== All three tree traversal methods provide an additional argument (``*selectors``) that can be used to filter which nodes/paths are produced. This argument can be specified multiple times (which also means that all other arguments must be passed as keyword arguments). Passing a node class -------------------- In the simplest case you can pass a ``Node`` subclass to get only instances of that class. The following example prints all the links on the Python home page: from ll.xist import xsc, parse from ll.xist.ns import xml, html doc = parse.tree( ·  parse.URL("http://www.python.org"), ·  parse.Expat(ns=True), ·  parse.Node(pool=xsc.Pool(xml, html, chars)) ) for node in doc.walknodes(html.a): ·  print(node.attrs.href) Finding all links on the Python home page This gives the output: http://www.python.org/ http://www.python.org/#left%2Dhand%2Dnavigation http://www.python.org/#content%2Dbody http://www.python.org/search http://www.python.org/about/ http://www.python.org/news/ http://www.python.org/doc/ http://www.python.org/download/ http://www.python.org/getit/ http://www.python.org/community/ ... Passing multiple selector arguments ----------------------------------- You can also pass multiple classes to search for nodes that are an instance of any of the classes: The following example will print all header element on the Python home page: from ll.xist import xsc, parse from ll.xist.ns import xml, html, chars doc = parse.tree( ·  parse.URL("http://www.python.org"), ·  parse.Expat(ns=True), ·  parse.Node(pool=xsc.Pool(xml, html, chars)) ) for node in doc.walknodes(html.h1, html.h2, html.h3, html.h4, html.h5, html.h6): ·  print(node.string()) Finding all headers on the Python home page This will output:

Help

Package Index

Quick Links (2.7.3)

Quick Links (3.3.0)

Python Jobs

Python Merchandise

Python Wiki

Python Insider Blog

Python 2 or 3?

Help Fund Python

Non-English Resources

Python Programming Language – Official Website

Support the Python Community

Python 3 Poll

NASA uses Python...

What they are saying...

Using Python For...

Python 3.3.0 released

Third rc for Python 3.3.0 released

Python Software Foundation announces Distinguished Service Award

ConFoo conference in Canada, February 25th - March 13th

Second rc for Python 3.3.0 released

First rc for Python 3.3.0 released

Fifth annual pyArkansas conference to be held

Passing a callable ------------------ It is also possible to pass a function to ``walk``. This function will be called for each visited node and gets passed the path to the visited node. If the function returns true, the node will be output. The following example will find all external links on the Python home page: from ll.xist import xsc, parse from ll.xist.ns import xml, html, chars doc = parse.tree( ·  parse.URL("http://www.python.org"), ·  parse.Expat(ns=True), ·  parse.Node(pool=xsc.Pool(xml, html, chars)) ) def isextlink(path): ·  return isinstance(path[-1], html.a) and not str(path[-1].attrs.href).startswith("http://www.python.org") for node in doc.walknodes(isextlink): ·  print(node.attrs.href) Finding external links on the Python home page This gives the output: http://docs.python.org/devguide/ http://pypi.python.org/pypi http://docs.python.org/2/ http://docs.python.org/3/ http://wiki.python.org/moin/ http://blog.python.org/ http://wiki.python.org/moin/Python2orPython3 http://wiki.python.org/moin/Languages http://wiki.python.org/moin/Languages ... ``xfind`` selectors ------------------- The selector arguments for the walk methods get converted into a so called xfind selector. xfind selectors look somewhat like XPath expressions, but are implemented as pure Python expressions (overloading various Python operators). Every subclass of ``ll.xist.xsc.Node`` can be used as an xfind selector and combined with other xfind selector to create more complex ones. For example searching for links that contain images works as follows: for path in doc.walkpaths(html.a/html.img): ·  print(path[-2].attrs.href, path[-1].attrs.src) Searching for ``img`` inside ``a`` with an xfind expression The output looks like this: http://www.python.org/ http://www.python.org/images/python-logo.gif http://www.python.org/#left%2Dhand%2Dnavigation http://www.python.org/images/trans.gif http://www.python.org/#content%2Dbody http://www.python.org/images/trans.gif http://www.python.org/psf/donations/ http://www.python.org/images/donate.png http://wiki.python.org/moin/Languages http://www.python.org/images/worldmap.jpg http://www.python.org/about/success/usa/ http://www.python.org/images/success/nasa.jpg If the ``img`` elements are not immediate children of the ``a`` elements, the xfind selector above won't output then. In this case you can use a “decendant selector” instead of a “child selector”. To do this simply replace ``html.a/html.img`` with ``html.a//html.img``. Apart from the ``/`` and ``//`` operators you can also use the ``|`` and ``&`` operators to combine xfind selector: from ll.xist import xsc, parse, xfind from ll.xist.ns import xml, html doc = parse.tree( ·  parse.URL("http://www.python.org"), ·  parse.Expat(ns=True), ·  parse.Node(pool=xsc.Pool(xml, html, chars)) ) for node in doc.walknodes((html.a | html.area) & xfind.hasattr("href")): ·  print(node.attrs.href) Here's another example that finds all elements that have an ``id`` attribute: from ll.xist import xsc, parse, xfind from ll.xist.ns import xml, html, chars doc = parse.tree( ·  parse.URL("http://www.python.org"), ·  parse.Expat(ns=True), ·  parse.Node(pool=xsc.Pool(xml, html, chars)) ) for node in doc.walknodes(xfind.hasattr("id")): ·  print(node.attrs.id) The output looks like this: screen-switcher-stylesheet logoheader logolink logo skiptonav skiptocontent utility-menu searchbox searchform ... For more examples refer to the documentation of the ``xfind`` module. CSS selectors ------------- It's also possible to use CSS selectors as selectors for the ``walk`` method. The module ``ll.xist.css`` provides a function ``selector`` that turns a CSS selector expression into an xfind selector: from ll.xist import xsc, parse, css from ll.xist.ns import xml, html, chars doc = parse.tree( ·  parse.URL("http://www.python.org"), ·  parse.Expat(ns=True), ·  parse.Node(pool=xsc.Pool(xml, html, chars)) ) for cursor in doc.walk(css.selector("div#menu ul.level-one li > a")): ·  print(cursor.node.attrs.href) Using CSS selectors as xfind selector This outputs all the first level links in the navigation: http://www.python.org/about/ http://www.python.org/news/ http://www.python.org/doc/ http://www.python.org/download/ http://www.python.org/getit/ http://www.python.org/community/ http://www.python.org/psf/ http://docs.python.org/devguide/ Most of the CSS 3 selectors are supported. For more examples see the documentation of the ``css`` module.