Parse Simple XML Files using Bash – Extract Name Value Pairs and Attributes

I have written up a simple routine parseXML to parse simple XML files to extract unique name values pairs and their attributes. The script extracts all xml tags of the format <abc arg1="hello">xyz</abc> and dynamically creates bash variables which hold values of the attributes as well as the elements. This is a good solution, if you don’t wish to use xpath for some simple xml files. However you will need xmllint installed on your system to use the script. Here’s a sample script which uses the parseXML function

#!/bin/bash
xmlFile=$1

function parseXML() {
  elemList=( $(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep -e "</.*>$" | while read line; do \
    echo $line | sed -e 's/^.*<\///' | cut -d '>' -f 1; \
  done) )

  totalNoOfTags=${#elemList[@]}; ((totalNoOfTags--))
  suffix=$(echo ${elemList[$totalNoOfTags]} | tr -d '</>')
  suffix="${suffix}_"

  for (( i = 0 ; i < ${#elemList[@]} ; i++ )); do
    elem=${elemList[$i]}
    elemLine=$(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>")
    echo $elemLine | grep -e "^</[^ ]*>$" 1>/dev/null 2>&1
    if [ "0" = "$?" ]; then
      continue
    fi
    elemVal=$(echo $elemLine | tr '\011' '\040'| sed -e 's/^[ ]*//' -e 's/^<.*>\([^<].*\)<.*>$/\1/' | sed -e 's/^[ ]*//' | sed -e 's/[ ]*$//')
    xmlElem="${suffix}$(echo $elem | sed 's/-/_/g')"
    eval ${xmlElem}=`echo -ne \""${elemVal}"\"`
    attrList=($(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>" | tr '\011' '\040' | sed -e 's/^[ ]*//' | cut -d '>' -f 1  | sed -e 's/^<[^ ]*//' | tr "'" '"' | tr '"' '\n'  | tr '=' '\n' | sed -e 's/^[ ]*//' | sed '/^$/d' | tr '\011' '\040' | tr ' ' '>'))
    for (( j = 0 ; j < ${#attrList[@]} ; j++ )); do
      attr=${attrList[$j]}
      ((j++))
      attrVal=$(echo ${attrList[$j]} | tr '>' ' ')
      attrName=`echo -ne ${xmlElem}_${attr}`
      eval ${attrName}=`echo -ne \""${attrVal}"\"`
    done
  done
}

parseXML
echo "$status_xyz |  $status_abc |  $status_pqr" #Variables for each  XML ELement
echo "$status_xyz_arg1 |  $status_abc_arg2 |  $status_pqr_arg3 | $status_pqr_arg4" #Variables for each XML Attribute
echo ""

#All the variables that were produced by the parseXML function
set | /bin/grep -e "^$suffix"

The XML File used for the above script example is:

<?xml version="1.0"?>
<status>
  <xyz arg1="1"> a </xyz>
  <abc arg2="2"> p </abc>
  <pqr arg3="3" arg4="a phrase"> x </pqr>
</status>

The root tag, which in this case is “status”, is used as a suffix for all variables. Once the XML file is passed to the function, it dynamically creates the variables $status_xyz, $status_abc, $status_pqr, $status_xyz_arg1, $status_abc_arg2, $status_pqr_arg3 and $status_pqr_arg4.

The output when the script is ran with the xml file as an argument is

@$ bash  parseXML.sh test.xml 
a |  p |  x
1 |  2 |  3 | a phrase

status_abc=p
status_abc_arg2=2
status_pqr=x
status_pqr_arg3=3
status_pqr_arg4='a phrase'
status_xyz=a
status_xyz_arg1=1

This script won’t work for XML files like the one below with duplicate element names.

<?xml version="1.0"?>
<status>
  <test arg1="1"> a </test>
  <test arg2="2"> p </test>
  <test arg3="3" arg4="a phrase"> x </test>
</status>

This script also won’t be able to extract attributes of elements without any CDATA. For eg, the script won’t be able to create variables corresponding to <test arg1="1">. It will only create the variables corresponding to <test1 arg2="2">abc</test1>.

<?xml version="1.0"?>
<status>
  <test arg1="1">
    <test1 arg2="2">abc</test1>
  </test>
</status>

11 comments

  1. Hi Partik,

    This script is wonderful! It saved a lot of my time by using this script. But I have an issue here.

    Is it possible to be able to loop through the records if there are multiple set of records?

    a
    p
    x

    123
    456
    789

    Such that when I run, I get the following:
    parseXML.sh test.xml
    a | p | x
    123 | 456 | 789

    Cheers,
    Jfk

    1. Sorry, not sure how to enter code in here. Below is the replacement of the with [].

      [?xml version=”1.0″?]
      [status]
      [xyz arg1=”1″] a [/xyz]
      [abc arg2=”2″] p [/abc]
      [pqr arg3=”3″ arg4=”a phrase”] x [/pqr]
      [/status]
      [status]
      [xyz arg1=”1″] 123 [/xyz]
      [abc arg2=”2″] 456 [/abc]
      [pqr arg3=”3″ arg4=”a phrase”] 789 [/pqr]
      [/status]

  2. I want to parse a xml as below, and store the location and filename in variables. How can I do it?

    ABC.sql
    C:\Data\DDL
    This is Data Model File
    XYZ.sql
    C:\DDL
    This DDL file is to drop all the tables of Database

    ABC.zip
    C:\Reports
    The Reports zip file to be imported.
    EFG.zip
    C:\Model
    The Framework model files

  3. this script is really helpful. But when I try to run it, the script didn’t work at first because of the XMLLINT_INDENT variable value is “empty string” instead of “single space”. Maybe the HTML format causes the single space to disappear.

    1. Hi Kenneth, XMLLINT_INDENT is supposed to be empty, because I don’t want the xml to be indented. I checked the script again and it works for me with XMLLINT_INDENT as an empty string. I remember some corner cases, where the indentation was causing an issue.

  4. Hi I was wondering if some can provide script to reverse this procedure(to generate same xml format like above from same text format)?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.