Learn XQuery and XPath: start programming with NUX+XOM and TagSoup Java libraries


Fev 24 2007, 18h57

I do not blog about here very often but this time I will make an exception.

If you are not a Java programmer or web developer, and have no intention of being one - you can stop reading this post right here.

If you are or do want to be one of these kinds of people, read on.

Last month, amazingly, the W3 (World Wide Web consortium) ended its nearly half-decade long tradition of not approving the XQuery 1.0 specification.

They actually approved it and it became an official W3 Recommendation (read: "standard") in January 2007.

XQuery is very nice, bordering on fantastic, for slicing/dicing XML data or using it to create web pages.

With the addition of a good HTML-cleansing XML parser, like TagSoup - you can also easily "scrape" information from web pages using XQuery.

This week I rediscovered NUX, a US DOE-sponsored open source Java library for running XQuery scripts. NUX leverages off of XOM.

XOM is a very nice open source XML programming framework.

XOM makes it easy to parse XML, perform XSLT transformations upon it, search it using XPath expressions, modify it, and write it out again in pretty-printed or regular forms.

If you are starting off with HTML and not XML data, XOM only requires half a line of code - just the fully qualified name of an HTML-tolerant XML parser - to be added to your Java program.

XOM was written in Java by Rusty Harold, a well-published XML developer and author. Rusty writes and programs simply. That is he takes complicated ideas and problems, and expresses his solutions to them in a way that is easy for a fellow Java programmer or XML developer to understand - and use.

Rusty's XOM framework is nothing short of amazing. If you are dealing with XML - parsing, building, searching, or pretty-printing it - and you are a Java programmer, you must download it and start playing around with it.

It is so simple to use, it will not take you any time at all to get started. It is safe to say that no matter which XML framework you have used to date - XOM is easier.

If you are a Java programmer, I hope I have encouraged you to take a look at the XQuery scripting language and these XML libraries for Java programmers that I mentioned.

This stuff is really good, really simple to use, and very powerful.

I have a couple of books on XQuery. However, the last time I read them was probably two or three years ago. I had forgotten most of what I had learned about XQuery.

I had also fiddled around with using XQuery to grab a little data from web pages on my Mac, using Apple's free Sherlock plugin for developers. The last time I messed with that was also a couple years or more ago.

In the course of one morning this week, of actually writing some XQuery scripts to scrape a lot of data from HTML pages, I got to be pretty decent at using XQuery.

I discovered I could easily manipulate HTML and XML easily, using a fraction of the lines of code that it would take to write it in Java. I also discovered the finished script was far easier to read and understand than the equivalent program would have been in Java or another general purpose programming language.

So grab your NUX, XOM, and TagSoup libraries and brush up on your XQuery skills and then leap in - the surf is up again!

What are you waiting for?


Deixe um comentário. Faça login na Last.fm ou cadastre-se agora (é gratuito).