Computational Knowledge Discovery and Learning in Complex Systems


Simon Kasif
University of Illinois at Chicago &
Johns Hopkins University, Baltimore


The goal of scientific research is to understand the world around us in general, and the behavior of complex systems and processes (physical, biological, social, organizational and computational) in particular.

This understanding allows us to create new methods to monitor, and manipulate complex systems, invent new processes and subsequently create numerous opportunities for increasing the quality of human life and endeavor.

The scientific method typically involves two essential steps:


Model Formation:


Model Verification:

While this approach to science had (and still has) enormous successes, (largely due to the remarkable human intellect behind it), the scientific method sketched above (before computers) was necessarily limited in scope largely due to severe lack of computational resources, infrastructures and computational tools supporting model formation and verification.

In particular:

  1. The models had to be concise, e.g, relatively simple (in description complexity) mathematical equations in physics, or a small set of definitions or cases in medicine.
  2. The model had to be supported very simple methods for testing its validity (e.g, statistical regression).
  3. The models were created invented by humans and verified by a typically non-interactive simulation or relatively simple data analysis procedures.

This is not surprising therefore that the physical processes we understood in previous centuries are often captured by a single equation. However, there are many processes that have substantial inherent complexity (e.g, Kolmogorov complexity), that are difficult to specify, understand and analyze without computational tools that expand our ability to probe into the mystery of vastly more complex systems (such as the brain, biological systems, economic processes, computational finance, and organizational structures) that have been tackled in the past.

Modern computers allow us to fundamentally change our approach to the scientific process in general and scientific discovery, modeling, and verification in particular.

  1. The class of specification languages to describe complex processes is substantially increased to include a wide arsenal of new high level languages such as probabilistic languages, rules, grammars, decision trees, complex probabilistic models, constraints, hierarchical representations, logic, Bayes networks, complex modeling tools, etc. This allows us to specify and emulate enormously complex systems which was previously unfeasible.
  2. The process of fitting models to existing data and providing insights into the data through the use of models, can be supported by extensive capabilities to store and analyze terabytes of data (e.g, the work on speech understanding, star/galaxy classification, DNA sequence modeling).
  3. The process of scientific discovery and model formation in particular, can now be supported by new data mining tools that allow adaptive formation of different models and their verification (e.g as in classification of astronomical objects or gene location in DNA sequences).
  4. The theoretical underpinnings of computer science allow us to develop deeper insights into physical or economic processes that involve computational considerations (e.g, bounded rationality in economics and computational complexity in biological processes, etc).
  5. Interaction with scientific data and models can now be supported by a collection of new adaptive tools that allow active learning, visualization, and more direct manipulation of the process.

The new "Computational Method" for discovery in science and engineering includes:

  1. new adaptive representation tools for "world" modeling,
  2. novel algorithms for inference and manipulation of adaptive representations
  3. advanced computational infrastructures,
  4. adaptive search and retrieval mechanisms from massive databases,
  5. new visualization and summarization tools,
  6. new data and representation fusing mechanisms

The most interesting and perhaps the most promising aspect of this method is that this new approach tends to have a profound effect on many rather diverse disciplines. This is perhaps not that surprising considering that the "mathematical method" also had a most profound effect on a variety of disciplines as well.

Two particularly prominent examples of this new approach to scientific discovery is the use of probabilistic methods (that became feasible only due to recent computational developments in theory, algorithms and hardware), is now dominating a wide variety of disciplines such as gene finding and location, protein function understanding, speech understanding, natural language understanding, user modeling in Microsoft systems and other domains.

At the same time computational biology and bioinformatics is a field that is becoming vastly technology rich and is proceeding at enormous speed building on WEB-based biological databases, and systems built to facilitate and aid scientific discovery.

In this talk we will describe several of the new generation systems based on intelligent systems technology that perform gene finding DNA sequence modeling and generally database retrieval in biological databases. The systems are based on learning algorithms and adaptive probabilistic representations that became feasible only recently due to the remarkable breakthroughs in computing speed and memory capacity.

We expect the technology and the scientific knowledge created by this research to have a major impact on scientific discovery and computational learning of complex systems in physical sciences, engineering, business and social science communities. This computational approach for knowledge discovery appears to enhance the opportunities for scientific breakthroughs, learning complex systems, and aid in the process of probing the most challenging problems of our society.