Jacks_Depression

Jacks_Python-libxml2

Posted: 2009-08-24 16:42:59

Beware the python bindings for libxml. I was using it in a server application and the thing kept crashing. It would fill up all the memory and throw memory errors. Let me show you the proof I created. If you see something wrong let me know.

import time, random, sys

import libxml2

alphabit = 'abcdefghijklmnopqrstuvwxyz'
rand = random.Random()

def main():
    while True:
        randXml = makeRandomXml()
        resultChar = parseForResult(randXml)
        sys.stdout.write(resultChar)
        sys.stdout.flush()
        try:
            time.sleep(0.1)
        except KeyboardInterrupt:
            print('-')
            break

def parseForResult(xmlStr):
    rootElement = libxml2.parseDoc(xmlStr)
    childKey = getText(rootElement, '*/@key')
    finalChar = getText(rootElement, '*/%s' % childKey)
    return finalChar

def makeRandomXml():
    child = randWord(8)
    other = randWord(10)
    result = randChar()
    return '<randomXml key="%s"><%s/><%s>%s</%s></randomXml>' % (child, other, child, result, child)

def randWord(length):
    output = ''
    for x in range(length):
        output = output + randChar()
    return output

def randChar():
    return rand.choice(alphabit)

def getText(element, expression=None):
    if expression:
        returnVal = None
        node = element.xpathEval2(expression)
        if node:
            returnVal = node[0].content
        del node
        return returnVal
    else:
        return element.content

if __name__ == "__main__":
    main()

What the code does is simple. It creates a xml string, the schema never changes but the tag names are random. For example...

<randomXml key="aichjaxu"><ryvbjcoxbs/><aichjaxu>r</aichjaxu></randomXml>

This is just for something to parse. The root node is parsed. Then does an xpath search for the key string. Similar to my real world case.

I chose the gnu 'top' command to watch the memory usage. I noticed it growing randomly at about 4 bytes a second. For something that will be running for months or years without needing to be restarted, this is a real problem. Of course, this does not prove that the problem is with libxml2. So, lets replace the parseForResult function.

def parseForResult(xmlStr):
    finalChar = xmlStr[49]
    return finalChar

Run it again, observe for a while. I did not see it grow by a single byte in the whole time I watched it.

For good measure, lets try and make sure all the tracks are covered. Make sure these vars get deleted.

def parseForResult(xmlStr):
    rootElement = libxml2.parseDoc(xmlStr)
    childKey = getText(rootElement, '*/@key')
    finalChar = getText(rootElement, '*/%s' % childKey)
    del rootElement
    return finalChar


def getText(element, expression=None):
    if expression:
        returnVal = None
        node = element.xpathEval2(expression)
        if node:
            returnVal = node[0].content
        del node
        return returnVal
    else:
        return element.content

It looks to me that the memory size grows a little slower but still growing none the less.

And this is what I did to fix the problem...

from xml.etree import ElementTree

def parseForResult(xmlStr):
    rootElement = ElementTree.fromstring(xmlStr)
    childKey = 'None'
    if 'key' in rootElement.attrib:
        childKey = rootElement.attrib['key']
    focusNode = rootElement.find('./%s' % childKey)
    finalChar = 'X'
    if focusNode is not None:
        finalChar = focusNode.text
    return finalChar

python-lxml seems to be better when it comes to memory but still leaks. I heard of ElementTree before the libxml bindings. I did not use it though becuse of its poor xpath support. It would seem there is not a single native python lib that has full support of xpath.