Parsing XML – Web Development


The next thing we’re going to learn about is how do we parse XML? Now, I’m not going make you write an actual parser. I think there’s actually probably a whole class in Udacity learning how to do almost exactly that. What I’m going to show you how to do is use the built-in parser in Python. Python has a library called “minidom,” and you can get it by saying something like this: Now, one thing I would like to point out real quick here is when you’re working with XML you’ll often see this word “dom” up here. What this stands for is “document object model.” This basically refers to the internal representation of an XML document. In Python you would have an object that has a list of children, and each of these children is some sort of tag object, and a tag object may have a name and an attribute and contents and that sort of thing. Any time you’re dealing with XML programmatically, you’ll see references to a dom, or if you’re working in your browser, you’ll see references to “the dom,” which kind of refers to the document, the HTML that you’re manipulating programmatically. In this particular case we’re going to use minidom. Why is it called minidom and not something else? Well, “mini” kind of implies that this is a smaller, lightweight version of this dom parser. Actually, parsing XML is actually a really complicated thing, because you can get XML that is many, many gigabytes large sometimes. Parsing all of that text is nontrivial. But when you’re only parsing a little bit of text, you can use this library minidom, which is basically simple and fast and will break if you throw lots and lots of gigabytes of text at it but for our purposes will work just great. I’m not going to quiz you on this sort of stuff. Just kind of carry this with you. Dom refers to the computer representation of the XML, and minidom is a handy library for manipulating this stuff in Python. Now I will show you how to use it. Here we are in Python. I’m going to give you a little demo of minidom before you start using it on your own.>From xml.dom import minidom. Now we have our minidom. Minidom has a function on it called “parseString,” which is a function for just parsing a string of XML. Let’s go ahead and give that a whirl. I’ve typed up some example HTML. We have an opening , some text, an opening tag– remember these tag names I’m just making up. HTML has specific tags that you need to use. XML you can have whatever arbitrary tags you want. It’s up to the people reading and writing the XML to agree on the tag names. I created some items–item 1, item 2. I closed my tag and I closed my . Now, when I was typing this, I had a little typo here, and I’m kind of curious to see what happens. Let’s go ahead and run this with the typo and see–oh, boy! [chuckles] So I ran this with the typo to see what would happen, and we get an error–a mismatched tag. That kind of makes sense. We have an opening “chilrdren” and a closing “children.” Let’s just make this proper. Okay, we’re going to run this without the type, and I’m going to store it in a variable so I have access to it. I’ll call it x. All right. This time no exception. If we were to take a peak at x, we can see we have this minidom document instance. Let’s take a peak at what we have on x. Holy smokes! Look at all this stuff. There is a lot of interesting things here in x. It looks like appendchild, functions for manipulating the document, all this creating nodes, and stuff like that. Some lookup functions–these are what we’re going to be using later– getElementById, getElementByTagName. NS refers to a name space. All sorts of stuff–parentNode, some output functions. Toprettyxml–this is actually an interesting one, so let’s play with this one. This is one I use all the time. If we were to take our document object and call “toprettyxml” on it–toprettyxml. This actually doesn’t look very pretty, does it, at all? Let’s print that, because this is the Python string with the new lines in it. If we were to actually print it, it would look a lot prettier. Here is the xml that I entered, and you can see the structure of the document. It indents it nicely for us. That’s a handy little function. When you download XML from somewhere you can see the structure of it a little bit more clearly with prettyxml. Okay, there’s a function I’d like to show you here. Get elements by a tag name. Now, if I were to run this function on our x object and give it mytag, it returns one dom element. If I were to run it on item, we actually get two dom elements. Looking at the first tag called “item,” we can see that we have an item. If we were to look at its children, we can call child nodes to see a list of children. We can see that we have one text node. If we were to look at the first one of those, we can access the node value attribute and see that it’s 1. Now, remember our pretty printed version of our XML. What we just did here was we said get me all of the elements that are called item. Here is 1, and here is 2. On this first one, which is this guy here, get me its first child, which is basically this node here, which isn’t strictly a node, but in minidom it’s represented as a text node, which is basically just this text content. Different libraries may handle contents differently, but in minidom this is how we get it. Then we can actually say get the value of that text node. That’s how we got the number 1 right there. This u basically means that’s a unicode string. Minidom assumed that we were entering a unicode string, which is fine.

Add a Comment

Your email address will not be published. Required fields are marked *