Introduction
What's
the difference between data and information? Data is a collection
of facts. Information is data that has been interpreted for use.
Unfortunately, we are all drowning in data, without the benefit
of information we can actually use. That's because data comes at
us constantly from a myriad of sources like water out of a fire
hose - a rapid, overpowering stream of numbers, facts, and figures.
Over
the past decade, Teledyne Scientific Company has been engaged in
numerous, challenging real world data analysis problems and learned
the following
important lessons:
- Preprocessing
is the most important step in data analysis. Unfortunately,
this step is ignored in today's leading data mining software
products. Why? Because preprocessing involves multiple algorithms
in multiple disciplines.
- In
most applications, over 90% of stored data is either irrelevant
or redundant. The value of data must be quantified instead
of blindly storing data in multiple data warehouses. Without
thorough understanding of the value of information, the storage
requirement can be an order of magnitude too great.
- Useful
data are stored in multiple sources in various formats. Without
integrating information from these disparate sources, it is
difficult to have an integrated view of the entire solution
space. Therefore, it is of paramount importance that multiple
metadata views be created for multiple departments so that
they can work with one voice.
- There
is too much emphasis on sophisticated learning. In general,
system errors can be divided into two types - model mismatch
and data mismatch. Model mismatch occurs when the learning
algorithm does not capture all the nuances present in the training
data, while data mismatch is the direct result of the actual
real world data being different from the training data with
which the algorithm parameters are tuned. In most real world
situations, the magnitude of data mismatch is much greater
than model mismatch. On numerous occasions, simpler learning
algorithms outperform their more sophisticated brethrens when
it matters.
- Human
operators appreciate insights.
They do not like unnecessarily complex scientific charts. They
would like the system to answer the question "where is
the beef?" and then provide drill down capabilities.
- Most
performance gains can be attributable to an integrated and
mutually complementary processing approach.
That is, there is no silver bullet in data mining. If there
is, then the pattern is too trivial to be exploited. Therefore,
the core data mining module must be surrounded by a complementary
set of processing algorithms, where the classification algorithm
is specifically designed for the underlying good feature distribution
and good features carefully selected from an available set
of preprocessing and transformation algorithms.
Teledyne Scientific
Company has developed advanced data mining technology to tame
the data glut or "whiteout" and turn it into usable information.
Their new technology is actually an extremely powerful, user friendly
computer program that uses sophisticated search and optimizing
algorithms to extract useful information from confusing, often
seemingly unrelated data.
Originally
designed for use by the U.S. Navy, this technology was the core
software engine created to facilitate algorithm development for
real time navigation through minefields. Now it's available for
development by business to help companies find their way through
the minefield of stored data and gain competitive advantage in
the marketplace.
What
sets this technology apart from other data mining applications
is its simplicity, ease-of-use, and flexibility. Most data mining
tools require IT staff assistance, expensive data warehouse tools
to handle legacy databases, and analysis that can often delay results.
Teledyne Scientific Company's technology combines several methodologies,
including state-of-the-art digital signal processing, pattern classification,
knowledge extraction,
reasoning under uncertainty, optimization, decision analysis, and
data visualization. Its in-depth functionality enables a wide variety
of data resources to be mined and exploited, even images such as
photographs and x-rays. This comprehensive scrutiny enables previously
unseen trends, relationships, and patterns to emerge from the data.
The results are available in real time and presented in clear,
simple English for immediate use by decision makers without exhaustive
briefings from trained statisticians.
Teledyne Scientific
Company
has assembled an impressive array of preprocessing, feature extraction,
data mining, and information visualization algorithms from decades
of basic and applied R&D contracts for various U.S. government
laboratories and industry consulting. Our data analysis experiences
include sonar, radar, relational customer and patient databases,
sleep disorder data, microarray image data, human tissues for cancer
cell detection, time series trend analysis, accounting, diagnostics,
reconnaissance imagery, and surface anomaly, and weather image.
Teledyne Scientific Company has filed for seven patents in the
above six areas.
Typical
Problems
Below
are typical data management and analysis problems that can be successfully
addressed by Teledyne Scientific Company's data mining technology.
Too
much data with no analysis on value of information: Sarah Brighten
is a corporate IT analyst. Over the past two years, she has learned
to do more with less, as economic conditions show no sign of
improving. Not only is she responsible for the safe storage of
mountains of data in several corporate databases and data warehouses,
she has to respond to urgent daily requests from multiple departments,
each asking for the latest batch of data. She wonders why she
has to baby terabytes of data, much of which seem irrelevant
to increasing the bottom line of the corporation. Teledyne
Scientific Company's technology
contains easy-to-use tools for compressing and assessing the
value of information.
Software
tool is too complex and difficult to use: Kim Hughes is a marketing
director. She installed a multi-million dollar software system
for customer relationship management (CRM) over a year ago. It
took them almost a year to get the kinks out of the system and
make it work reliably over gobs of customer data. Unfortunately,
she feels that they bought a Rolls Royce when a Honda Civic with
customized options would've been sufficient. She's certain that
they probably use less than 10% of the system features. Furthermore,
because the CRM software is so complex, she had to hire a software
specialist who doesn't know marketing. As a result, Kim finds
herself spending more time on explaining to the specialist what
she wants. In addition, she finds mountains of charts and scientific
graphs generated by the software system intimidating and confusing.
She's frustrated. All she wanted was some actionable insights
in plain English from her customer data. Teledyne
Scientific Company's software tool
that can be used by domain experts with little expertise in data
mining.
No
provision for preprocessing: Samantha
Wong is a data mining specialist. She is annoyed with the data
mining software she's been using. With the recent emphasis on
trend analysis, she finds herself spending most of her time writing
custom software that will transform her unwieldy data into a
flat table that her data mining software requires. Since there
are so many ways to transform data, she's not sure if her method
is appropriate. Her boss doesn't seem to understand that she
spends all her time massaging data. He wonders why it is taking
her a long time to produce results with expensive data mining
software in her arsenal. April is flabbergasted when she hears
that the IT department decided to collect more time series data,
change the data format, and increase the sampling rate from one
hour to ten minutes. All her work during the past three months
just turned into a pile of dust and she needs to write more custom
software. Teledyne
Scientific Company's technology includes application specific preprocessing
engines that are tightly integrated with the back-end easy-to-use
data mining engine.
Key
Technical Concepts
Teledyne
Scientific Company developed its integrated approach to data analysis through
several years of government R&D work in military signal processing.
In
this field the main challenge is to extract and characterize extremely
low signal-to-noise ratio (SNR) events from multiple sensors with
high probability of detection and low false alarm rate. Since most
intercept systems must be capable of handling various types on
signal, the general approach is to run multiple transformation
and detection algorithms in parallel with the data fusion engine
at the back end. The second challenge is to filter and present
derived information so that critical knowledge can be absorbed
and disseminated as quickly and completely as possible.
The
key technology discriminator is that we bring sophisticated signal/image
processing, optimization, and data mining algorithms to domain
experts without the usual technical jargon, thus demystifying data
mining using simple language of intuition and information visualization.
The fundamental design principle is judicious dimension reduction
through data adaptive, sequential processing, which is conceptually
similar to finding sufficient statistics in data analysis. That
is, what is the minimum problem dimension that characterizes the
entire data? Instead of arguing over which data mining algorithm
is superior, we focus on finding the right set of algorithms given
data and feature characteristics while paying close attention to
the point of diminishing returns. This agnostic approach to data
analysis is not only intuitively appealing, but also yields far
superior performance to a dogmatic method that relies on a rigid
set of rules or algorithm preferences. Figure 1 shows our integrated
approach to data analysis.
Figure
1: The
integrated approach to knowledge discovery that combines all
the salient concepts in signal processing, optimization, and
learning to deal effectively with time series, image, and multidimensional
data with hierarchical relationships.
For
example, contrary to popular beliefs in the image analysis community,
we discovered that image compression based on wavelet set partitioning
in hierarchical trees actually improves automatic target recognition
(ATR) performance up to a certain point mainly because the benefits
of noise suppression outweigh degradation in the fidelity of desired
targets with high eccentricity and rough texture. This finding
has significant implications in data storage requirements and cell
recognition performance. On the other hand, one dimensional discrete
cosine transform (DCT) provides the best performance in image compression
of sonar grams because the predominant attributes of narrow band
grams are line-like. The key step here is finding the required
minimum problem dimension.
Dimension
reduction implies that a small number of transform coefficients
in a different domain captures most of the energy spread in the
original raw data space. This concept has found its niche in signal
and array processing: maximize the probability that multiple signals
can be sorted in space, time, and frequency through dimension reduction
or subspace filtering. This simple, yet powerful concept is exploited
in a systematic and integrated manner to tackle any challenging
data analysis problem.
In
gene chip and tissue image analysis, the same logic applies. The
key step is judicious dimension reduction as the processing stage
transitions from raw data to information to knowledge. For example,
a two-stage classify-before-detect (CBD) algorithm is capable of
detecting and characterizing low-expression spots by virtue of
energy compaction and dimension reduction. Similarly, a three-stage
image processing algorithm can handle various image-analysis problems
(rare cell detection, tissue recognition, and spot quality assessment)
using appropriate levels of abstraction and dimensionality reduction.
Let's
extend the same concept to classification. If the underlying class-conditional
good-feature distribution is unimodal Gaussian, there is very little
reason to resort to complex classification algorithms, such as
support vector machines or radial basis functions. A simple multivariate
Gaussian classifier will work equally well with substantially lower
computational requirements. The algorithm recommendation engine
is part of the Intelligent Data Mining Wizard that guides
novice users through sometimes tedious and confusing data mining
steps. Furthermore, hierarchical sequential pruning classification
can provide excellent performance even for situations with complex
distributions thanks to sequential dimension reduction.
In
summary, the most important ingredient in successful data analysis
is the seamless integration of various dimension reduction methodologies,
all optimized to the underlying data characteristics. Probabilistic
modeling of relationships between data and algorithms is currently
in progress so that we can gain better insight into data analysis
methodologies. This insight, coupled with more rigorous analysis
of the impacts of various transformation and compression algorithms
on the accentuation of desirable signal attributes and attenuation
of undesirable components, will be invaluable in demystifying data
mining and turning it into an essential and appreciated partner
in conquering the problem of data "whiteout."
Application
Examples
The
following three short examples illustrate the major functionalities
of our approach to integrated data analysis:
- Leukemia
diagnosis
- Magazine subscriber
analysis
- Thrombosis
diagnosis
Leukemia
Diagnosis
Recent
advances in cDNA and oligonucleotide gene chip technology have been
instrumental in allowing us to take multiple snapshots of gene level
activities at an arbitrary level of abstraction in diagnostic and
prognostic applications. In order to demonstrate the utility of gene
chips in diagnostic applications, the Whitehead/MIT Center for Genome
Research prepared a set of 7,129 gene expression data collected from
72 patients suffering from two types of cancer - acute myeloid leukemia
(AML) and acute lymphoblastic leukemia (ALL). Figure 2 shows the
three easy data analysis steps.
Figure
2: After
loading the metadata, the user can select outputs and inputs
using the I/O Help wizard (a). After I/O specification, the Intelligent
DM wizard takes over and recommends an appropriate set of algorithms
with their parameters filled in (b). The user has an option of
proceeding with the recommended algorithms or selecting his/her
own algorithms. After the user clicks on the Run pushbutton,
the results are summarized in a combination of text and intuitive
rank order curve, which shows that three out of over seven thousand
genes are required for virtually 100% accuracy in leukemia diagnosis.
Magazine Subscriber
Prediction
In
this case, the goal is to identify the type of consumers likely to
subscribe to a magazine based on socioeconomic, demographic, and
other personal data. The user can type a question (How to predict
a magazine subscriber?) and the search engine can sift through the
existing database to find and recommend the most relevant metadata
sets. Again the I/O Help wizard can assist the user in specifying
inputs and outputs. Once the user confirms the I/O specification,
the DM engine performs all the necessary calculations and presents
the results in a succinct and understandable format.
In
this case, likely magazine subscribers hold bankcards, have pets,
and contribute to various organizations (i.e., socially active).
Thrombosis
Diagnosis
This
data set contains three relational database tables - general patient
information, thrombosis test results, and medical history data. Normally,
the end user would be required to transform this data into a flat
table using customized software before commencing data mining using
commercial data mining tools. However, Teledyne
Scientific Company's
preprocessing engine turns a patient's medical history data (irregularly
sampled
time series) into a set of compressed features that can be appended
to higher level relational database tables, thus creating a unified
metadata view.
The
results are summarized in terms of the best classification algorithm,
the actual performance in terms of Type I/II errors, and the best
feature subset to be used in thrombosis diagnosis.
Licensing/Services
Teledyne
Scientific Company offers licenses as well as software development and consulting services
to address each client's unique data analysis needs.
Licensing: Teledyne
Scientific Company desires to license this technology to a company that would like
to productize this software and take it to market. It is ideal
for data mining across a broad range of disciplines, including
pattern recognition and signal processing in scientific research;
competition and productivity analysis, risk management, and market
studies for business; risk analysis, actuarial research, and market
trends for the demand forecasting and insurance industries, to
name just a few.
Software
development: Teledyne
Scientific Company has developed several unique toolboxes for integrated data analysis
and high level system design for various U.S. government laboratories
and commercial research establishments. These are highly interactive
and versatile tools that can be customized to adequately address
each customer's specific needs. We will leverage the existing
tools in preprocessing, data mining, optimization, and visualization
to deliver a cost effective and highly optimized solution to
each client.
Consulting: Teledyne
Scientific Company will provide consulting services to work on any challenging data
analysis on an outsourcing basis.
Patents
Filed
Title
|
Description
|
| Automatic
mapping of data characteristics to image and signal processing
algorithms for feature extraction |
As
bandwidth becomes more plentiful, data mining must be able
to handle spatially and temporally sampled data, such as
image and time series data, respectively. This invention
describes a method to find appropriate digital signal processing
(DSP) and image processing (IP) algorithms based on data
characteristics. DSP and IP algorithms transform raw time
series and image data into projection spaces, where good
features can be extracted for data mining. |
| Automatic
data exploration to seek meaningful relationships among original
and derived fields by stealing CPU cycles |
This
invention presents a method or apparatus for automatic data
exploration with no human intervention when computer resources
are underutilized so that actual data mining tasks can be
performed with ease and speed. |
| Estimation
of the point of diminishing returns in data mining |
This
invention presents a method to quantify the extent to which
a data mining algorithm captures useful information embedded
in input data. The key concept is forward-reverse mapping
between feature space and classification space, where we
perform confusion analysis. That is, we quantify the consistency
in the levels of confusion in the two spaces. |
| Hierarchical
characterization of fields from multiple tables with one-to-many
relations for comprehensive data mining |
This
invention presents a method to summarize or characterize
information scattered over multiple tables that are related
through one-to-many relationships. The end result is a metadata
table, which is a collection of multiple relational tables. |
| Intelligent
performance optimizer that recommends a set of classifiers
and parameters based on good-feature distribution and user
preferences |
This
invention describes a knowledge-based, automated performance
optimizer that characterizes good-feature probability distribution
with a vector of features and assigns appropriate decision
algorithms by mapping the feature vector and user's preferences
onto a decision-algorithm surface. |
| Text
display of key data mining performance results |
The
invention conveys key performance results of a data mining
operation in plain English so that a novice user can understand
them without having to consult an expert for interpretation. |
| One-step
data mining with a "where am I" interrupt button |
This
invention describes a method or apparatus that permits one-step
data mining for novice users, thereby avoiding all the headaches
and confusions associated with the interactive nature of
data mining, namely specification of numerous parameters
associated with various steps in data mining. |
|