Parse Simple XML Files using Bash – Extract Name Value Pairs and Attributes

I have written up a simple routine parseXML to parse simple XML files to extract unique name values pairs and their attributes. The script extracts all xml tags of the format <abc arg1="hello">xyz</abc> and dynamically creates bash variables which hold values of the attributes as well as the elements. This is a good solution, if you don’t wish to use xpath for some simple xml files. However you will need xmllint installed on your system to use the script. Here’s a sample script which uses the parseXML function


function parseXML() {
  elemList=( $(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep -e "</.*>$" | while read line; do \
    echo $line | sed -e 's/^.*<\///' | cut -d '>' -f 1; \
  done) )

  totalNoOfTags=${#elemList[@]}; ((totalNoOfTags--))
  suffix=$(echo ${elemList[$totalNoOfTags]} | tr -d '</>')

  for (( i = 0 ; i < ${#elemList[@]} ; i++ )); do
    elemLine=$(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>")
    echo $elemLine | grep -e "^</[^ ]*>$" 1>/dev/null 2>&1
    if [ "0" = "$?" ]; then
    elemVal=$(echo $elemLine | tr '\011' '\040'| sed -e 's/^[ ]*//' -e 's/^<.*>\([^<].*\)<.*>$/\1/' | sed -e 's/^[ ]*//' | sed -e 's/[ ]*$//')
    xmlElem="${suffix}$(echo $elem | sed 's/-/_/g')"
    eval ${xmlElem}=`echo -ne \""${elemVal}"\"`
    attrList=($(cat $xmlFile | tr '\n' ' ' | XMLLINT_INDENT="" xmllint --format - | /bin/grep "</$elem>" | tr '\011' '\040' | sed -e 's/^[ ]*//' | cut -d '>' -f 1  | sed -e 's/^<[^ ]*//' | tr "'" '"' | tr '"' '\n'  | tr '=' '\n' | sed -e 's/^[ ]*//' | sed '/^$/d' | tr '\011' '\040' | tr ' ' '>'))
    for (( j = 0 ; j < ${#attrList[@]} ; j++ )); do
      attrVal=$(echo ${attrList[$j]} | tr '>' ' ')
      attrName=`echo -ne ${xmlElem}_${attr}`
      eval ${attrName}=`echo -ne \""${attrVal}"\"`

echo "$status_xyz |  $status_abc |  $status_pqr" #Variables for each  XML ELement
echo "$status_xyz_arg1 |  $status_abc_arg2 |  $status_pqr_arg3 | $status_pqr_arg4" #Variables for each XML Attribute
echo ""

#All the variables that were produced by the parseXML function
set | /bin/grep -e "^$suffix"

The XML File used for the above script example is:

<?xml version="1.0"?>
  <xyz arg1="1"> a </xyz>
  <abc arg2="2"> p </abc>
  <pqr arg3="3" arg4="a phrase"> x </pqr>

The root tag, which in this case is “status”, is used as a suffix for all variables. Once the XML file is passed to the function, it dynamically creates the variables $status_xyz, $status_abc, $status_pqr, $status_xyz_arg1, $status_abc_arg2, $status_pqr_arg3 and $status_pqr_arg4.

The output when the script is ran with the xml file as an argument is

@$ bash test.xml 
a |  p |  x
1 |  2 |  3 | a phrase

status_pqr_arg4='a phrase'

This script won’t work for XML files like the one below with duplicate element names.

<?xml version="1.0"?>
  <test arg1="1"> a </test>
  <test arg2="2"> p </test>
  <test arg3="3" arg4="a phrase"> x </test>

This script also won’t be able to extract attributes of elements without any CDATA. For eg, the script won’t be able to create variables corresponding to <test arg1="1">. It will only create the variables corresponding to <test1 arg2="2">abc</test1>.

<?xml version="1.0"?>
  <test arg1="1">
    <test1 arg2="2">abc</test1>

About Pratik Sinha

Linux Nerd, Socialist, Atheist, Adventuristic, Nature Lover, Geeky.


  1. I want to parse a xml as below, and store the location and filename in variables. How can I do it?

    This is Data Model File
    This DDL file is to drop all the tables of Database
    The Reports zip file to be imported.
    The Framework model files

  2. this script is really helpful. But when I try to run it, the script didn’t work at first because of the XMLLINT_INDENT variable value is “empty string” instead of “single space”. Maybe the HTML format causes the single space to disappear.

    1. Hi Kenneth, XMLLINT_INDENT is supposed to be empty, because I don’t want the xml to be indented. I checked the script again and it works for me with XMLLINT_INDENT as an empty string. I remember some corner cases, where the indentation was causing an issue.

  3. Hi I was wondering if some can provide script to reverse this procedure(to generate same xml format like above from same text format)?

Leave a Reply