PythonでXMLを扱う(3) | Never Too Late

前回はDOMを使ってXMLのツリー構造をただ見ただけだったが、今回は実際にタグの名称などを指定して、XMLから簡単な文章を作ってみる。

—–

sample.xml

<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a sample xml document -->
<world>
<country id="1">
<name>Japan</name>
<capital>Tokyo</capital>
</country>
<country id="2">
<name>Korea</name>
<capital>Seoul</capital>
</country>
<country id="3">
<name>United States</name>
<capital>Washington D.C</capital>
</country>
</world>

タグにid=”1″のような記述があるが、これを属性（Attribute）という。これもタグと一緒で任意の名称で任意の個数、つまり自由に設定できる。
このXMLファイルを以下のプログラムで読み込む。

from xml.dom import minidom, Node
def scanCountry(node):
print "CountryID%s" % (node.getAttribute("id")),
for child in node.childNodes:
if child.nodeType == Node.ELEMENT_NODE:
if child.tagName == 'name':
print 'is %s.' % (getText(child)),
if child.tagName == 'capital':
print 'Its capital city is %s.' % (getText(child))
def getText(node):
s = ''
for child in node.childNodes:
if child.nodeType == Node.TEXT_NODE:
s += child.wholeText
return s
if __name__ == '__main__':
doc = minidom.parse('sample.xml')
for node in doc.getElementsByTagName('country'):
scanCountry(node)

getElementsByName(tagName)は全ての下位要素からtagNameという名前のタグのリストを探してくれる命令。このプログラムの場合だとsample.xml内の全てのcountryタグが取得されることになる。
getAttribute(attName)はattNameで指定した属性の値を文字列で返してくれる関数。この命令で属性idの値を取得している。
実行結果は以下のようになった。

CountryID1 is Japan. Its capital city is Tokyo.
CountryID2 is Korea. Its capital city is Seoul.
CountryID3 is United States. Its capital city is Washington D.C.

属性の値やTEXT要素の値を取得することにより、XML文書から文章を作成することができた。前回も書いたが、DOMがXML文書を扱うときのNodeの形式に関してはオンラインマニュアルに詳しい説明があります。
次回はDOMを使ってXML文書を作成してみる。もちろん普通の文字列操作のプログラムでもXML文書は作成できるが、DOMインターフェースを使うと正確な文法のXML文書がより簡単に作成できるようだ。