2 Question: Add index to child elements within XML file Python

question created at Sat, Jun 1, 2019 12:00 AM

I´m a novice in python, so please help. I´d like to add an index to row and column elements in same way like page elements have index.

Within page 1 there are 4 rows, so the index would go from 0 to 3. Within page 1, row 0 there is only one column, so index would be only 0. In page 1, row 2 there are 3 columns, so index for columns would go from 0 to 2. The same for the other rows in other pages.

I've been beginning testing with Elementree but only the basics to print elements. Maybe someone could help me with this.

I have the following code for basic testings only but I don´t know how to proceed with this.

import xml.etree.ElementTree as ET
tree = ET.parse('smp.xml')
root = tree.getroot()

for text in root.iter('text'):
    print(text.attrib)

for text in root.iter('text'):
    print(text.text)

The input XML is like this:

<?xml version="1.0"?>
<doc>
    <page index="0"/>
    <page index="1">
        <row>
            <column>
                <text>fibrous drupe</text>
            </column>
        </row>
        <row>
            <column>
                <text>follicle</text>
            </column>
            <column>
                <text>legume</text>
            </column>
        </row>
        <row>
            <column>
                <text>loment</text>
            </column>
            <column>
                <text>nut</text>
            </column>
            <column>
                <text>samara</text>
            </column>
        </row>
        <row>
            <column>
                <text>schizocarp</text>
            </column>
        </row>
    </page>
    <page index="2">
        <row>
            <column>
                <text>cypsela</text>
            </column>
        </row>
    </page>
    <page index="3"/>
</doc>

and I would like to convert it to this:

<?xml version="1.0"?>
<doc>
    <page index="0"/>
    <page index="1">
        <row index="0">
            <column index="0">
                <text>fibrous drupe</text>
            </column>
        </row>
        <row index="1">
            <column index="0">
                <text>follicle</text>
            </column>
            <column index="1">
                <text>legume</text>
            </column>
        </row>
        <row index="2">
            <column index="0">
                <text>loment</text>
            </column>
            <column index="1">
                <text>nut</text>
            </column>
            <column index="2">
                <text>samara</text>
            </column>
        </row>
        <row index="3">
            <column index="0">
                <text>schizocarp</text>
            </column>
        </row>
    </page>
    <page index="2">
        <row index="0">
            <column index="0">
                <text>cypsela</text>
            </column>
        </row>
    </page>
    <page index="3"/>
</doc>

I hope make sense. Thanks in advance.

0
2 Answers 2

See below

('56403870.xml' is the XML you have posted)

import xml.etree.ElementTree as ET

tree = ET.parse('56403870.xml')
root = tree.getroot()

pages = root.findall('.//page')
for page in pages:
    rows = page.findall('.//row')
    for r, row in enumerate(rows):
        row.attrib['index'] = str(r)
        columns = row.findall('.//column')
        for c, col in enumerate(columns):
            col.attrib['index'] = str(c)

ET.dump(tree)

output

<doc>
    <page index="0" />
    <page index="1">
        <row index="0">
            <column index="0">
                <text>fibrous drupe</text>
            </column>
        </row>
        <row index="1">
            <column index="0">
                <text>follicle</text>
            </column>
            <column index="1">
                <text>legume</text>
            </column>
        </row>
        <row index="2">
            <column index="0">
                <text>loment</text>
            </column>
            <column index="1">
                <text>nut</text>
            </column>
            <column index="2">
                <text>samara</text>
            </column>
        </row>
        <row index="3">
            <column index="0">
                <text>schizocarp</text>
            </column>
        </row>
    </page>
    <page index="2">
        <row index="0">
            <column index="0">
                <text>cypsela</text>
            </column>
        </row>
    </page>
    <page index="3" />
</doc>
1
2019-06-02 13:52:47Z
  1. Thank you balderman. It works nice. I see 3 loops are needed!
    2019-06-02 18:25:11Z

I'm new to python meself, so you have to finish this yourself:

import xml.etree.ElementTree as ET
tree = ET.parse('smp.xml')
root = tree.getroot()

for text in root:
    print(text.tag, text.attrib)
    for text2 in text:
        print(" ", text2.tag, text2.attrib)
        if (text2.tag=='row'):
           text2.set('index','42')

tree.write('output.xml')

In 'output.xml' you will get:

<doc>
    <page index="0" />
    <page index="1">
        <row index="42">
            <column>
                <text>fibrous drupe</text>
            </column>
        </row>
        <row index="42">
            <column>
                <text>follicle</text>
            </column>
            <column>
                <text>legume</text>
            </column>
        </row>
        <row index="42">
            <column>
               ….

So, you need to change those '42' to the value you need them to be. 😊

1
2019-06-01 15:50:48Z
  1. Many thanks for your help and your approach in how to accomplish this Luuk. Best regards
    2019-06-01 16:51:49Z
  2. @Parfait Hi. It doesn't work for me the solution :(
    2019-06-01 20:54:58Z
  3. @Parfait is not an error, simply the output is not as expected
    2019-06-02 17:25:20Z
  4. @Ger: Do you need help with this? or.... (I am a bit confused about what to do when reading above comments 😊)
    2019-06-02 17:47:48Z
  5. @Luuk Thanks for your help Luuk. Since both are new to Python, we can benefit from balderman's solution that works as expected. Thanks anyway
    2019-06-02 18:26:36Z
source placed here