ZetCode

Python XML with SAX

last modified February 15, 2025

In this article, we show how to use the SAX (Simple API for XML) in Python for event-driven XML parsing. SAX is a memory-efficient approach to parsing XML documents, making it suitable for large files. Unlike DOM (Document Object Model), SAX does not load the entire XML document into memory. Instead, it processes the document sequentially and triggers events as it encounters elements, attributes, and text.

The xml.sax module is part of Python's standard library, so no additional installation is required.

Basic SAX Parsing

The following example demonstrates how to parse an XML document using SAX. We create a custom handler class to handle events such as start elements, end elements, and character data.

main.py
import xml.sax
from io import StringIO

class MyHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current_element = ""
        self.current_data = ""

    # Called when an element starts
    def startElement(self, tag, attributes):
        self.current_element = tag
        if tag == "book":
            print("Book Id:", attributes["id"])

    # Called when an element ends
    def endElement(self, tag):
        if tag == "title":
            print("Title:", self.current_data)
        elif tag == "author":
            print("Author:", self.current_data)
        elif tag == "year":
            print("Year:", self.current_data)
        self.current_data = ""

    # Called when character data is found
    def characters(self, content):
        if self.current_element in ["title", "author", "year"]:
            self.current_data += content.strip()

# XML data
xml_data = """
<catalog>
    <book id="1">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <year>1925</year>
    </book>
    <book id="2">
        <title>1984</title>
        <author>George Orwell</author>
        <year>1949</year>
    </book>
</catalog>
"""

# Create a SAX parser
parser = xml.sax.make_parser()
handler = MyHandler()
parser.setContentHandler(handler)

# Parse the XML data
parser.parse(StringIO(xml_data))

In this program, the MyHandler class inherits from xml.sax.ContentHandler and overrides the startElement, endElement, and characters methods to handle XML events.

parser.parse(StringIO(xml_data))

The StringIO is used to create an in-memory file-like object from the xml_data string. This allows the parser.parse method to read the XML data as if it were reading from a file.

$ python main.py
Book Id: 1
Title: The Great Gatsby
Author: F. Scott Fitzgerald
Year: 1925
Book Id: 2
Title: 1984
Author: George Orwell
Year: 1949

Handling Attributes

The following example demonstrates how to handle attributes in XML elements using SAX.

main.py
import xml.sax
from io import StringIO

import xml.sax


class MyHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current_element = ""

    # Called when an element starts
    def startElement(self, tag, attributes):
        self.current_element = tag
        if tag == "book":
            print("Book Id:", attributes["id"])
            print("Category:", attributes["category"])

    # Called when an element ends
    def endElement(self, tag):
        pass

    # Called when character data is found
    def characters(self, content):
        pass


# XML data
xml_data = """
<catalog>
    <book id="1" category="fiction">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <year>1925</year>
    </book>
    <book id="2" category="dystopian">
        <title>1984</title>
        <author>George Orwell</author>
        <year>1949</year>
    </book>
    <book id="3" category="fiction">
        <title>War and Peace</title>
        <author>Leo Tolstoy</author>
        <year>1869</year>
    </book>
</catalog>
"""

# Create a SAX parser
parser = xml.sax.make_parser()
handler = MyHandler()
parser.setContentHandler(handler)

# Parse the XML data
parser.parse(StringIO(xml_data))

In this program, the startElement method is used to handle attributes of the book element, such as id and category.

$ python main.py
Book Id: 1
Category: fiction
Book Id: 2
Category: dystopian
Book Id: 3
Category: fiction

Parsing XML Files

The following example demonstrates how to parse an XML file using SAX. This approach is memory-efficient because it processes the file sequentially without loading it entirely into memory.

products.xml
<products>
    <product>
        <id>1</id>
        <name>Product 1</name>
        <price>10.99</price>
        <quantity>30</quantity>
    </product>
    <product>
        <id>2</id>
        <name>Product 2</name>
        <price>20.99</price>
        <quantity>130</quantity>
    </product>
    <product>
        <id>4</id>
        <name>Product 4</name>
        <price>24.59</price>
        <quantity>350</quantity>
    </product>
    <product>
        <id>5</id>
        <name>Product 5</name>
        <price>9.9</price>
        <quantity>650</quantity>
    </product>
    <product>
        <id>6</id>
        <name>Product 6</name>
        <price>45</price>
        <quantity>290</quantity>
    </product>
</products>

This is the file.

main.py
from xml.sax import make_parser, ContentHandler

class ProductHandler(ContentHandler):
  def __init__(self):
    self.current_data = ""
    self.product = {}
  
  def startElement(self, name, attrs):
    self.current_data = ""
    if name == "product":
      self.product = {}
  
  def characters(self, content):
    self.current_data += content.strip()
  
  def endElement(self, name):
    if name != "product":
      self.product[name] = self.current_data
    else:
      print(f"Id: {self.product['id']}, Name: {self.product['name']}")

parser = make_parser()
parser.setContentHandler(ProductHandler())
parser.parse("products.xml")

In this program, the parser.parse method is used to parse a XML file named products.xml. The SAX parser processes the file sequentially, making it suitable for large files.

$ python main.py
Id: 1, Name: Product 1
Id: 2, Name: Product 2
Id: 4, Name: Product 4
Id: 5, Name: Product 5
Id: 6, Name: Product 6

Source

Python SAX - Documentation

In this article, we have shown how to use the SAX API in Python for event-driven XML parsing. The SAX approach is memory-efficient and suitable for large XML files.

Author

My name is Jan Bodnar, and I am a passionate programmer with extensive programming experience. I have been writing programming articles since 2007. To date, I have authored over 1,400 articles and 8 e-books. I possess more than ten years of experience in teaching programming.

List all Python tutorials.