Parse Simple XML Files using Bash – Extract Name Value Pairs and Attributes

I have written up a simple routine parseXML to parse simple XML files to extract unique name values pairs and their attributes. The script extracts all xml tags of the format <abc arg1="hello">xyz</abc> and dynamically creates bash variables which hold values of the attributes as well as the elements. This is a good solution, if you don’t wish to use xpath for some simple xml files. However you will need xmllint installed on your system to use the script. Here’s a sample script which uses the parseXML function

#!/bin/bash
xmlFile=$1

function parseXML() {
  elemList=( $(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep -e "</.*>$" | while read line; do \
    echo $line | sed -e 's/^.*<\///' | cut -d '>' -f 1; \
  done) )

  totalNoOfTags=${#elemList[@]}; ((totalNoOfTags--))
  suffix=$(echo ${elemList[$totalNoOfTags]} | tr -d '</>')
  suffix="${suffix}_"

  for (( i = 0 ; i < ${#elemList[@]} ; i++ )); do
    elem=${elemList[$i]}
    elemLine=$(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>")
    echo $elemLine | grep -e "^</[^ ]*>$" 1>/dev/null 2>&1
    if [ "0" = "$?" ]; then
      continue
    fi
    elemVal=$(echo $elemLine | tr '\011' '\040'| sed -e 's/^[ ]*//' -e 's/^<.*>\([^<].*\)<.*>$/\1/' | sed -e 's/^[ ]*//' | sed -e 's/[ ]*$//')
    xmlElem="${suffix}$(echo $elem | sed 's/-/_/g')"
    eval ${xmlElem}=`echo -ne \""${elemVal}"\"`
    attrList=($(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>" | tr '\011' '\040' | sed -e 's/^[ ]*//' | cut -d '>' -f 1  | sed -e 's/^<[^ ]*//' | tr "'" '"' | tr '"' '\n'  | tr '=' '\n' | sed -e 's/^[ ]*//' | sed '/^$/d' | tr '\011' '\040' | tr ' ' '>'))
    for (( j = 0 ; j < ${#attrList[@]} ; j++ )); do
      attr=${attrList[$j]}
      ((j++))
      attrVal=$(echo ${attrList[$j]} | tr '>' ' ')
      attrName=`echo -ne ${xmlElem}_${attr}`
      eval ${attrName}=`echo -ne \""${attrVal}"\"`
    done
  done
}

parseXML
echo "$status_xyz |  $status_abc |  $status_pqr" #Variables for each  XML ELement
echo "$status_xyz_arg1 |  $status_abc_arg2 |  $status_pqr_arg3 | $status_pqr_arg4" #Variables for each XML Attribute
echo ""

#All the variables that were produced by the parseXML function
set | /bin/grep -e "^$suffix"

The XML File used for the above script example is:

<?xml version="1.0"?>
<status>
  <xyz arg1="1"> a </xyz>
  <abc arg2="2"> p </abc>
  <pqr arg3="3" arg4="a phrase"> x </pqr>
</status>

The root tag, which in this case is “status”, is used as a suffix for all variables. Once the XML file is passed to the function, it dynamically creates the variables $status_xyz, $status_abc, $status_pqr, $status_xyz_arg1, $status_abc_arg2, $status_pqr_arg3 and $status_pqr_arg4.

The output when the script is ran with the xml file as an argument is

@$ bash  parseXML.sh test.xml 
a |  p |  x
1 |  2 |  3 | a phrase

status_abc=p
status_abc_arg2=2
status_pqr=x
status_pqr_arg3=3
status_pqr_arg4='a phrase'
status_xyz=a
status_xyz_arg1=1

This script won’t work for XML files like the one below with duplicate element names.

<?xml version="1.0"?>
<status>
  <test arg1="1"> a </test>
  <test arg2="2"> p </test>
  <test arg3="3" arg4="a phrase"> x </test>
</status>

This script also won’t be able to extract attributes of elements without any CDATA. For eg, the script won’t be able to create variables corresponding to <test arg1="1">. It will only create the variables corresponding to <test1 arg2="2">abc</test1>.

<?xml version="1.0"?>
<status>
  <test arg1="1">
    <test1 arg2="2">abc</test1>
  </test>
</status>

About Pratik Sinha

Linux Nerd, Socialist, Atheist, Adventuristic, Nature Lover, Geeky.

8 comments

  1. I want to parse a xml as below, and store the location and filename in variables. How can I do it?

    ABC.sql
    C:\Data\DDL
    This is Data Model File
    XYZ.sql
    C:\DDL
    This DDL file is to drop all the tables of Database

    ABC.zip
    C:\Reports
    The Reports zip file to be imported.
    EFG.zip
    C:\Model
    The Framework model files

  2. this script is really helpful. But when I try to run it, the script didn’t work at first because of the XMLLINT_INDENT variable value is “empty string” instead of “single space”. Maybe the HTML format causes the single space to disappear.

    1. Hi Kenneth, XMLLINT_INDENT is supposed to be empty, because I don’t want the xml to be indented. I checked the script again and it works for me with XMLLINT_INDENT as an empty string. I remember some corner cases, where the indentation was causing an issue.

    2. This is all kind of disgusting. No 30 Rock, Parks and Rec, or Community for best comedy? No Breaking Bad or Mad Men for best drama? I was happily voting along when I made it to those categories, and decided that these awards are just stupid. Good luck, Nathan Fillion! I was going to vote for you, but don’t care to make it through the voting process

  3. Hi I was wondering if some can provide script to reverse this procedure(to generate same xml format like above from same text format)?

Leave a Reply

Your email address will not be published. Required fields are marked *