Python XML Parsing: A Comprehensive Guide
XML (Extensible Markup Language) is a popular language used for storing and transferring data. Python provides several libraries to parse XML. In this guide, we’ll introduce you to two built-in Python libraries for parsing XML: xml.dom.minidom and xml.etree.ElementTree.
Step 1: XML Parsing Using xml.dom.minidom
The xml.dom.minidom library allows you to parse XML documents with Python. To parse an XML string, use the parseString method:
from xml.dom.minidom import parseString
xml_string = """
<library>
<book>
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<year>1925</year>
</book>
</library>
"""
document = parseString(xml_string)
print(document.getElementsByTagName("title")[0].firstChild.nodeValue)
In the code above, we parse the XML string and extract the book title.
Step 2: XML Parsing Using xml.etree.ElementTree
The xml.etree.ElementTree (ET) library provides a more Pythonic way to parse XML. To parse an XML string, use the fromstring method:
import xml.etree.ElementTree as ET
xml_string = """
<library>
<book>
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<year>1925</year>
</book>
</library>
"""
root = ET.fromstring(xml_string)
for child in root.iter():
if child.text.strip():
print(child.text)
In the code above, we parse the XML string and print the text of each element.
Step 3: Parsing XML Files
Both minidom and ElementTree can parse XML from files using the parse method:
# Using minidom
from xml.dom.minidom import parse
document = parse("sample.xml")
print(document.getElementsByTagName("title")[0].firstChild.nodeValue)
# Using ElementTree
import xml.etree.ElementTree as ET
root = ET.parse("sample.xml")
for child in root.iter():
if child.text.strip():
print(child.text)
The code above demonstrates how to parse an XML file and print some elements.
Step 4: Saving XML Data to a CSV File
After parsing the XML, you can save the data to a CSV file using the pandas library:
import pandas as pd
parsed_dict = {
"title": ["The Great Gatsby"],
"author": ["F. Scott Fitzgerald"],
"year": [1925]
}
df = pd.DataFrame(parsed_dict)
df.to_csv("parsed_xml_data.csv", index=False)
Step 5: Handling Invalid XML
The Beautiful Soup library can parse XML documents that may have errors:
from bs4 import BeautifulSoup
invalid_xml = """
<root>
<person>
<name>John Doe</name>
<message>This is a message & an invalid XML example.</message>
</person>
</root>
"""
soup = BeautifulSoup(invalid_xml, features="lxml-xml")
print(soup.prettify())
Beautiful Soup can handle invalid XML, but it is slower than other XML parsing libraries.
Comments (0)
There are no comments here yet, you can be the first!