Parsing XML: Does it have to be so annoying?
Some time ago I got a task to implement a handler for API in XML and to be honest, I thought: “Oh, no! Not XML, not again!” I’m aware that I’m not an expert at XML and that may be the reason why I’m not a big fan of it. I decided to do something about that.
The most irritating concern with XML is obtaining data. So far in Python I was using standard built-in module xml, lxml or BeautifulSoup. For all of them this process always looks the same, which means navigating over nodes, checking childs, getting text values, etc., which always results in many lines of code.
Then I started digging and stumbled on something named XPath.
The official site says that “XPath is used to navigate through elements and attributes in an XML document. XPath is a major element in W3C’s XSLT standard.” I would add that XPath is sort of a CSS selector for XML. What is great about XPath is that many popular languages have its implementation, so you can use it for example in Python, C/C++, Java, C#, Ruby and many others.
To show how useful XPath is I will present a very simple example in Python. Let’s try to parse the Tivix team page and print all the members’ names and positions. First, we need to get the page’s content:
>>> import urllib >>> url = 'https://www.tivix.com/team/'; >>> page_content = urllib.urlopen(url).read()
After a quick source code review, you will see this structure of team members:
<body> ... <div class='team-member'> ... <h4> ...name... </h4> <h5> ...position... </h5> </div> <div class='team-member'> ... <h4> ...name... </h4> <h5> ...position... </h5> </div> ... </body>
Based on this, we can write xpaths that will return two lists. First with member names, second with their positions:
>>> from lxml import etree >>> parser = etree.HTMLParser() >>> xml_root = etree.XML(page_content, parser=parser)
xml_root now contains the HTML file in a tree structure. We can call xpath on this object:
>>> member_list = xml_root.xpath("//div[@class='team-member']//h4/text()") >>> position_list = xml_root.xpath("//div[@class='team-member']//h5/text()") >>> print zip(member_list, position_list) [('Bret Waters', 'CEO'), ('Sumit Chachra', 'CTO'), ('Francis Cleary', 'Technical Architect'), ('Dariusz Fryta', 'Technical Architect'), …]
Isn’t that simpler?
What I like the most in this solution is that when xml has no elements matching our xpath, it will return just an empty list. We don’t have to worry about catching exceptions, checking if node A has node B, or other conditions.
To read more details about XPath syntax, check this site.