I have written up a simple routine parseXML
to parse simple XML files to extract unique name values pairs and their attributes. The script extracts all xml tags of the format <abc arg1="hello">xyz</abc>
and dynamically creates bash variables which hold values of the attributes as well as the elements. This is a good solution, if you don’t wish to use xpath for some simple xml files. However you will need xmllint
installed on your system to use the script. Here’s a sample script which uses the parseXML
function
#!/bin/bash
xmlFile=$1
function parseXML() {
elemList=( $(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep -e "</.*>$" | while read line; do \
echo $line | sed -e 's/^.*<\///' | cut -d '>' -f 1; \
done) )
totalNoOfTags=${#elemList[@]}; ((totalNoOfTags--))
suffix=$(echo ${elemList[$totalNoOfTags]} | tr -d '</>')
suffix="${suffix}_"
for (( i = 0 ; i < ${#elemList[@]} ; i++ )); do
elem=${elemList[$i]}
elemLine=$(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>")
echo $elemLine | grep -e "^</[^ ]*>$" 1>/dev/null 2>&1
if [ "0" = "$?" ]; then
continue
fi
elemVal=$(echo $elemLine | tr '\011' '\040'| sed -e 's/^[ ]*//' -e 's/^<.*>\([^<].*\)<.*>$/\1/' | sed -e 's/^[ ]*//' | sed -e 's/[ ]*$//')
xmlElem="${suffix}$(echo $elem | sed 's/-/_/g')"
eval ${xmlElem}=`echo -ne \""${elemVal}"\"`
attrList=($(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>" | tr '\011' '\040' | sed -e 's/^[ ]*//' | cut -d '>' -f 1 | sed -e 's/^<[^ ]*//' | tr "'" '"' | tr '"' '\n' | tr '=' '\n' | sed -e 's/^[ ]*//' | sed '/^$/d' | tr '\011' '\040' | tr ' ' '>'))
for (( j = 0 ; j < ${#attrList[@]} ; j++ )); do
attr=${attrList[$j]}
((j++))
attrVal=$(echo ${attrList[$j]} | tr '>' ' ')
attrName=`echo -ne ${xmlElem}_${attr}`
eval ${attrName}=`echo -ne \""${attrVal}"\"`
done
done
}
parseXML
echo "$status_xyz | $status_abc | $status_pqr" #Variables for each XML ELement
echo "$status_xyz_arg1 | $status_abc_arg2 | $status_pqr_arg3 | $status_pqr_arg4" #Variables for each XML Attribute
echo ""
#All the variables that were produced by the parseXML function
set | /bin/grep -e "^$suffix"
The XML File used for the above script example is:
<?xml version="1.0"?>
<status>
<xyz arg1="1"> a </xyz>
<abc arg2="2"> p </abc>
<pqr arg3="3" arg4="a phrase"> x </pqr>
</status>
The root tag, which in this case is “status”, is used as a suffix for all variables. Once the XML file is passed to the function, it dynamically creates the variables $status_xyz
, $status_abc
, $status_pqr
, $status_xyz_arg1
, $status_abc_arg2
, $status_pqr_arg3
and $status_pqr_arg4
.
The output when the script is ran with the xml file as an argument is
@$ bash parseXML.sh test.xml a | p | x 1 | 2 | 3 | a phrase status_abc=p status_abc_arg2=2 status_pqr=x status_pqr_arg3=3 status_pqr_arg4='a phrase' status_xyz=a status_xyz_arg1=1
This script won’t work for XML files like the one below with duplicate element names.
<?xml version="1.0"?>
<status>
<test arg1="1"> a </test>
<test arg2="2"> p </test>
<test arg3="3" arg4="a phrase"> x </test>
</status>
This script also won’t be able to extract attributes of elements without any CDATA. For eg, the script won’t be able to create variables corresponding to <test arg1="1">
. It will only create the variables corresponding to <test1 arg2="2">abc</test1>
.
<?xml version="1.0"?>
<status>
<test arg1="1">
<test1 arg2="2">abc</test1>
</test>
</status>
Hi Partik,
This script is wonderful! It saved a lot of my time by using this script. But I have an issue here.
Is it possible to be able to loop through the records if there are multiple set of records?
a
p
x
123
456
789
Such that when I run, I get the following:
parseXML.sh test.xml
a | p | x
123 | 456 | 789
Cheers,
Jfk
[code]
a
p
x
123
456
789
[/code]
Sorry, not sure how to enter code in here. Below is the replacement of the with [].
[?xml version=”1.0″?]
[status]
[xyz arg1=”1″] a [/xyz]
[abc arg2=”2″] p [/abc]
[pqr arg3=”3″ arg4=”a phrase”] x [/pqr]
[/status]
[status]
[xyz arg1=”1″] 123 [/xyz]
[abc arg2=”2″] 456 [/abc]
[pqr arg3=”3″ arg4=”a phrase”] 789 [/pqr]
[/status]
I want to parse a xml as below, and store the location and filename in variables. How can I do it?
ABC.sql
C:\Data\DDL
This is Data Model File
XYZ.sql
C:\DDL
This DDL file is to drop all the tables of Database
ABC.zip
C:\Reports
The Reports zip file to be imported.
EFG.zip
C:\Model
The Framework model files
this script is really helpful. But when I try to run it, the script didn’t work at first because of the XMLLINT_INDENT variable value is “empty string” instead of “single space”. Maybe the HTML format causes the single space to disappear.
Hi Kenneth, XMLLINT_INDENT is supposed to be empty, because I don’t want the xml to be indented. I checked the script again and it works for me with XMLLINT_INDENT as an empty string. I remember some corner cases, where the indentation was causing an issue.
Hi I was wondering if some can provide script to reverse this procedure(to generate same xml format like above from same text format)?
I think you meant the root tag is used as a *prefix. But other wise well written article, it really helped me make a configurable backup script.
Oh yes, it should be prefix!