Gentle Introduction to Reading and Writing XML using Python

October 18, 2010 by hs

There are many ways to interact with XML using Python. Here I will provide a simple introduction to reading and writing XML using lxml.

Create (Write) XML

Here I will try to create a sample XML similar to how FreeSWITCH creates its extensions/users.



from lxml import etree

root = etree.Element("include")

comment1 = etree.Comment("This is a comment")

root.append(comment1)



First we create the root element, which is the include tag in this case. Then we add a comment to it.



user = etree.SubElement(root, "user")

user.set('id', '1000')



We create a user tag, which is a sub-element of the include tag. Using the set method, we have created a single attribute. The name of this attribute is id and its value is 1000.



params = etree.SubElement(user, "params")



Here we created a sub-element, params, of the user tag. Here params is a tag as well and does not have any attributes.



param = etree.SubElement(params, "param")

param.set('name', 'password')

param.set('value', '$${default_password}')



We create a sub-element of params called param. It has two attributes and their names are name and value. Their values are password and $${default_password} respectively.



param = etree.SubElement(params, "param")

param.set('name', 'vm-password')

param.set('value', '1000')



We create another sub-element of params with different attributes. This is to demonstrate that we can create as many sub-elements of a tag (element or sub-element) as required.

variables = etree.SubElement(user, "variables")

Here we created another sub-element, variables, of the user tag/element, similar to params.



variable = etree.SubElement(variables, "variable")

variable.set('name', 'toll_allow')

variable.set('value', 'domestic,international,local')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'accountcode')

variable.set('value', '1000')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'user_context')

variable.set('value', 'default')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'effective_caller_id_name')

variable.set('value', 'Extension 1000')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'effective_caller_id_number')

variable.set('value', '1000')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'outbound_caller_id_name')

variable.set('value', '$${outbound_caller_name}')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'outbound_caller_id_number')

variable.set('value', '$${outbound_caller_id}')

variable = etree.SubElement(variables, "variable")

variable.set('name', 'callgroup')

variable.set('value', 'techsupport')

variable.text = 'This can contain data'



The above code creates a lot of different sub-elements, each called variable of the variables element/tag. Notice that we set some text in the .text at the end. All other variable tags do not have any “data” while the last one does. This is where I have moved away from the FreeSWITCH file because in it variable contains attributes and no “data”.



root_tree = etree.ElementTree(root)

print etree.tostring(root_tree, pretty_print=True)



Above we use the initial, root tag (include in this case) and traverse it to create a “tree”. All the tags we defined above are now in this tree structure. At the end we simply print the complete tree. The output should be similar to the one below.

<include> <!--This is a comment--> <user id="1000"> <params> <param name="password" value="$${default_password}"/> <param name="vm-password" value="1000"/> </params> <variables> <variable name="toll_allow" value="domestic,international,local"/> <variable name="accountcode" value="1000"/> <variable name="user_context" value="default"/> <variable name="effective_caller_id_name" value="Extension 1000"/> <variable name="effective_caller_id_number" value="1000"/> <variable name="outbound_caller_id_name" value="$${outbound_caller_name}"/> <variable name="outbound_caller_id_number" value="$${outbound_caller_id}"/> <variable name="callgroup" value="techsupport">This can contain data</variable> </variables> </user> </include>

Parse (Read) XML

Reading XML is very similar to writing it.



from lxml import etree

infile = open("1000.xml", 'r')



In the above code we open the XML file we created above (which we stored in file called 1000.xml in this case) for reading. If you’re running this on Python 3 then open it as read+binary, rb, instead of read-only.

context = etree.iterparse(infile, events=("start", "end"))

It’s a good idea to read an XML file iteratively so that if reading large files we do not store everything in memory at once. This reduces the memory requirements of reading large files. We have created an iterator which will read the file, infile. Since iterparse uses “events”, we are using two main events, namely start and end. “Start” occurs when a tag is encountered for the first time and “end” occurs when the tag is closed.

for event, element in context: print 'Event:', event print 'Element Tag:', element.tag print 'Element Text:', element.text print 'Element Items', element.items() print 'Previous Element', element.getprevious() print 'Parent Element', element.getparent()

In the above code we iterate over the XML file. The context iterator(?) returns two things on every pass: event (start or end in our case) and the element (or tag) read/encountered. The “element” object has some attributes and methods which we have used here:

tag contains the tag (include, user, params, variables, etc. in our example)

text contains any “data” the element might contain. In our case, the last variable contains data

items() returns a list containing attributes. These attributes have a name and a value. For example, each param contains two attributes with names name and value and their respective values

getprevious() returns the last element in the “tree”

Each element (or tag) in XML has exactly one parent and getparent() returns that tag (or element)

infile.close()

Finally, we close the input file. I will add one more thing: if you are searching for a particular tag (or element), you can provide it to iterparse like so: context = etree.iterparse(infile, events=("start", "end"), tag="param") .

By running the above code on 1000.xml input file, you get output similar to the one provided below.

Event: start

Element Tag: include

Element Text:

Element Items []

Previous Element None

Parent Element None

Event: start

Element Tag: user

Element Text:

Element Items [(‘id’, ‘1000’)]

Previous Element <!–This is a comment–>

Parent Element <Element include at b7737784>

Event: start

Element Tag: params

Element Text:

Element Items []

Previous Element None

Parent Element <Element user at b77377ac>

Event: start

Element Tag: param

Element Text: None

Element Items [(‘name’, ‘password’), (‘value’, ‘$${default_password}’)]

Previous Element None

Parent Element <Element params at b77377d4>

Event: end

Element Tag: param

Element Text: None

Element Items [(‘name’, ‘password’), (‘value’, ‘$${default_password}’)]

Previous Element None

Parent Element <Element params at b77377d4>

Event: start

Element Tag: param

Element Text: None

Element Items [(‘name’, ‘vm-password’), (‘value’, ‘1000’)]

Previous Element <Element param at b77377fc>

Parent Element <Element params at b77377d4>

Event: end

Element Tag: param

Element Text: None

Element Items [(‘name’, ‘vm-password’), (‘value’, ‘1000’)]

Previous Element <Element param at b77377fc>

Parent Element <Element params at b77377d4>

Event: end

Element Tag: params

Element Text:

Element Items []

Previous Element None

Parent Element <Element user at b77377ac>

Event: start

Element Tag: variables

Element Text:

Element Items []

Previous Element <Element params at b77377d4>

Parent Element <Element user at b77377ac>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘toll_allow’), (‘value’, ‘domestic,international,local’)]

Previous Element None

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘toll_allow’), (‘value’, ‘domestic,international,local’)]

Previous Element None

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘accountcode’), (‘value’, ‘1000’)]

Previous Element <Element variable at b7737874>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘accountcode’), (‘value’, ‘1000’)]

Previous Element <Element variable at b7737874>

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘user_context’), (‘value’, ‘default’)]

Previous Element <Element variable at b773789c>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘user_context’), (‘value’, ‘default’)]

Previous Element <Element variable at b773789c>

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘effective_caller_id_name’), (‘value’, ‘Extension 1000’)]

Previous Element <Element variable at b77378c4>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘effective_caller_id_name’), (‘value’, ‘Extension 1000’)]

Previous Element <Element variable at b77378c4>

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘effective_caller_id_number’), (‘value’, ‘1000’)]

Previous Element <Element variable at b77378ec>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘effective_caller_id_number’), (‘value’, ‘1000’)]

Previous Element <Element variable at b77378ec>

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘outbound_caller_id_name’), (‘value’, ‘$${outbound_caller_name}’)]

Previous Element <Element variable at b7737914>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘outbound_caller_id_name’), (‘value’, ‘$${outbound_caller_name}’)]

Previous Element <Element variable at b7737914>

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘outbound_caller_id_number’), (‘value’, ‘$${outbound_caller_id}’)]

Previous Element <Element variable at b773793c>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: None

Element Items [(‘name’, ‘outbound_caller_id_number’), (‘value’, ‘$${outbound_caller_id}’)]

Previous Element <Element variable at b773793c>

Parent Element <Element variables at b773784c>

Event: start

Element Tag: variable

Element Text: This can contain data

Element Items [(‘name’, ‘callgroup’), (‘value’, ‘techsupport’)]

Previous Element <Element variable at b7737964>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variable

Element Text: This can contain data

Element Items [(‘name’, ‘callgroup’), (‘value’, ‘techsupport’)]

Previous Element <Element variable at b7737964>

Parent Element <Element variables at b773784c>

Event: end

Element Tag: variables

Element Text:

Element Items []

Previous Element <Element params at b77377d4>

Parent Element <Element user at b77377ac>

Event: end

Element Tag: user

Element Text:

Element Items [(‘id’, ‘1000’)]

Previous Element <!–This is a comment–>

Parent Element <Element include at b7737784>

Event: end

Element Tag: include

Element Text:

Element Items []

Previous Element None

Parent Element None

Hat Tips

I strongly recommend that you read up on XML if you are not familiar with it. I could not have written this post without the help of: Parsing XML and HTML with lxml; High-performance XML parsing in Python with lxml; The lxml.etree Tutorial; Write xml file using lxml library in Python; Changing the default indentation of etree.tostring in lxml