CompLearn: Artificial Intelligence the Easy Way

From What The Wiki?!


CompLearn: Artificial Intelligence the Easy Way by Rudi Cilibrasi

I've just spent the last 3 years studying for my PhD at the Centrum voor Wiskunde en Informatica in Amsterdam, the same place Stephanie Wehner and Sandor Heman work. Although I spend most of my time studying data compression research, right now I am decompressing after the most mind blowing conference I have ever attended in Europe. What the hack was a real eye-opener for me. In most large gatherings I've been to in the past, the number of people present seems to vary inversely with the caliber of the average attendee. I have been to some interesting meetings of just 5 or 10 people, and I have seen the odd 2600 meeting of 20 or 30 people which went off well. But in meetings (or companies, for that matter) of more than 100-200 people I have rarely been pleased; usually they are overrun with fraternity-boys drinking towards incoherence and increasing the volume until everybody is shouting over everybody else. This would not be so terrible if they had something interesting to say, but in my experience it has always been so dull before. I had thought this was a universal principle until I came to What the Hack in 2005. WTH amazed me in so many ways. The most incredible was the fact that more than 90% of the people present were deeply creative individuals. It shocked me again and again to overhear the conversations going on around me and with me going into ever deeper levels of technical detail, apparently with no end in site. The presentations held nothing back, such as the Pocket-PC shell code snippet in hand coded assembly or the mesmerizing Bump-Key double demo. This was engineering honed to its finest, and it combined with a natural and organic sense of community and overflowing generosity. Everywhere I went it was as if this world was a place apart and above the normal world. I had no idea it was possible to get so many talented and creative people together in one place. And the people were not just clever; they were also generous with their knowledge, willing to share and explain in whatever ways necessary to get the point across. In my short time at WTH I learned something about WiFi as well as the proper way to pitch my tent. I also had something to offer: a new version of my main open source software project called CompLearn at http://complearn.org/

CompLearn is the first open-source drag-and-drop data mining system. It has a Simple Direct Media Layer library based 3D graphics animation system to display visually the results of its analysis. It accepts any files that are compressible because it does not make any extra assumptions or restrictions on the type or format of the incoming data. The core of the algorithm is a very special function called the Normalized Compression Distance. This formula takes two parameters in the form of files (or finite binary strings if you prefer the information-theoretic terminology) and returns just one scalar result, usually a number between 0 and 1. The wonderful thing about NCD is that it can effectively compute with a wide (indeed, universal, in principle) range of possible statistical properties. It requires no adjustment or parameter tuning like neural networks and most other machine learning techniques need. The main distinction of the complearn system is its parameter-free robust data mining. Because there are no required adjustments, there is no hassle in using it. There is a command-line version for windows and Linux with a modular extension system to add in new compressors. Built in right now we have bzip2, gzip, and the so-called "Google" compressor to calculate Normalized Google Distance. Together this simple beginning has shown great promise in both objective and subjective reasoning systems. I have also packaged the system using autoconf, automake, and libtool, so you may download the C99 library and begin using it immediately. I have followed a simple object-oriented style in my interfaces to allow for easy wrapping of bindings in many different languages and have already a version working in Ruby. It uses SOAP to connect to the Google database and allow for simple semantic reasoning.

I myself have used this system to automatically determined evolutionary trees from genetic sequence data on animals, viruses, fungi, and fish. Stephanie has used it to analyze computer viruses. I have used it to analyze music and separate it by genre (classical, rock, or jazz) or composer. I have fed in its own source code and it has returned a good organization of itself that made some sense and I still use to help me figure out what needs improvement. The best thing about using this system is that you don't have to be a math whiz to use it. I had a hard time learning differential equations and calculus, and so I usually try to avoid this type of math whenever possible. When using the complearn system, you never have to figure out an integral like in normal statistics. All you have to do, at the most, is perhaps write a simple translation script to convert your data from its original format into something that will compress better and more meaningfully than before. So for example you might take webpages and strip out the HTML tags and then convert all the remaining text to lowercase and make spacing consistent. This would then let you ignore the details of the HTML design and instead analyze just word choice. Or perhaps you have files in an old ASCII floating-point format where numbers are written out over many bytes in decimal. This might be converted into a simple fixed-precision quantized format to allow the gzip or bzip2 compressor to find meaningful substring matches easily. Let me assure you, for people that enjoy programming (and especially the Python, Perl, and Ruby crowd), this is much easier and faster than working out a lot of hard math. And the best part is that it is often more robust.

I had so many good conversations at What the Hack, and have heard so many great ideas, I cannot wait to see where it leads. I am going to be checking the archives. Some people have already thought of wonderful new applications for my software and I am excited to hear through email or VOIP how the developments proceed. If you are interested in talking with me more about complearn or anything else please do not hesitate to send me an email at cilibrar AT cilibrar DOT com ; several people have already emailed me and in fact I have connected one up with one of my researcher friends who is working on the exact idea that he himself had: .wav/.mp3 to .midi file conversion to allow for music analysis of real stuff. You can read a lot of papers about this system at http://cilibrar.com/ if you enjoy math. If not you can just enjoy the tree pictures and 3D demo which are interesting by themselves and as exploratory toys.

http://complearn.org/

To my new friends, congratulations on a wonderful conference! Hack on!!!

P.S. To "Felix": I have borrowed your neato-blue flashlight. Sorry, I could finally find my tent with the sunrise but was unable to find you when I woke up. Please email me to tell me how to return it to you.

Related Articles: Using Data Compressors for Robust Reasoning