DATA MINING , SECURITY AND PRIVACY
Bhavani Thuraisingham
The University of Texas at Dallas
What is Data Mining?
Data mining is the process of posing queries and extracting useful
information, patterns, and trends often previously unknown from large
quantities of data possibly stored in databases. Essentially, for
many organizations, the goals of data mining include improving marketing
capabilities, detect-ing abnormal patterns, and predicting the future
based on past experiences and current trends. There is clearly a need
for this technology. There are large amounts of current and historical
data being stored. Therefore, as databases become larger, it becomes
increasingly difficult to support decision making. In addition, the
data could be from multiple sources and multiple domains. There is
a clear need to analyze the data to support planning and related functions
of an enterprise.
Various terms have been used to refer to data mining. These include
knowledge/data/information discovery and knowledge/data/information
extraction. Some define data mining to be the process of extracting
previously unknown information while knowledge discovery is defined
as the process of making sense out of the extracted information.
Data Mining Technologies, Techniques and Applications
Data mining techniques include those based on statistical reasoning
techniques, inductive logic programming, machine learning, fuzzy sets,
and neural networks, among others. The data mining outcomes include
classification, finding rules to partition data into groups; association,
finding rules to make associations between data; and sequencing, finding
rules to order data. Essentially one arrives at some hypothesis, which
is the information extracted, from examples and patterns observed.
These patterns are observed by posing a series of queries where a
query may depend on the responses obtained to the previous queries
posed.
Data mining integrates multiple technologies. These include data management
such as database management and data warehousing, statistics, machine
learning, decision support, and others such as visualization and parallel
computing. There are a series of steps involved in data mining. These
include getting the data organized for mining, determining the desired
outcomes from mining, se-lecting tools for mining, carrying out the
mining, pruning the results so that only the useful ones are considered
further, taking actions based on the data mining results, and evaluating
the results of the actions to determine benefits.
While numerous developments have been made in data mining such as
improved accuracy, there are still many challenges. For example, due
to the large volumes of data, how can the algorithms determine which
technique to select and what type of data mining to perform? More
importantly how can one reduce the false positives and false negatives
that are present in the data mining re-sults? Often the data may be
incomplete and/or inaccurate due to data entry errors. At times there
may be redundant information, and at times there may not be sufficient
information. Therefore im-proving the data quality is critical if
the results of data mining are to be useful. Many of the devel-opments
in data mining have been on mining relational and structured databases.
The current trends include mining web data, mining distributed and
heterogeneous databases, and mining mul-timedia data.
Data mining has many applications including in medical, financial,
marketing and sales as well as in security. For example, neural networks
may be trained to detect a particular disease based on the symptoms.
Prediction techniques may be used for market forecasting. Association
rule mining techniques may be used to make connections between products
so that a corporation can market certain products together and improve
sales. Link analysis techniques may be used to find correla-tions
between suspicious people. Anomaly detection techniques may be used
to detect unauthozerd intrusions into a computer system or a network.
However using data mining for security applica-tions increases the
concerns for privacy as even naïve users can now use these data
mining tools and extract highly sensitive and private information.
Data Mining for Security Applications
The threats to homeland security include attacking buildings, destroying
critical infrastructures such as power grids and telecommunication
systems. Data mining techniques are being investi-gated to find out
who the suspicious people are and who is capable of carrying out terrorist
activi-ties.
Consider association rule mining techniques. The goal here is to find
items that go together. For example, consider the scenario where John
comes from Country X and he has an association with James who has
a criminal record. Furthermore, an unusually large percentage of people
from Coun-try X have carried out terrorist attacks. Because of the
associations between John and Country X, as well as between John and
James, and James and criminal records, one may conclude that John
has to be under observation. While association-rule mining techniques
are essentially intelligent search techniques, link analysis uses
graph theoretic methods for detecting patterns by following the nodes
and links. For example consider the chain “A is seen with B
and B is a friend of C and D travel and D has a criminal record.”
The question is what conclusions can one draw about A?
Next consider clustering techniques. For example, people with origins
in country X belonging to a certain religion may be grouped into Cluster
A. People with origins in country Y who are less than 50 years old
may form another Cluster B. These clusters are formed based on their
travel patterns, eating patterns, spending patterns and behavior patterns.
While clustering divides the population not based on any pre-specified
condition, classification divides the population based on some pre-defined
condition. The condition is found based on examples. For example,
one can form a profile of a terrorist with the following characteristics:
male less than 30 years of age and belonging to a certain religious
group and of a certain ethnic origin. This means that all males less
than 30 years belonging to the same religion and the same ethnic origin
will be classified into this group and could possibly be placed under
observation.
Another data mining outcome is anomaly detection. A good example here
is learning to fly an air-plane without wanting to learn to takeoff
or land. People in general want to get a complete training course
in flying. However there may be some individuals who want to learn
flying but do not care about take off or landing. This is an anomaly.
Data mining is also being applied to cyber security problems. These
include problems such as in-trusion detection and auditing. For example,
anomaly detection techniques could be used to detect unusual patterns
and behaviors. Link analysis may be used to trace the viruses to the
perpetrators. Classification may be used to group various cyber attacks
and then use the profiles to detect an at-tack when it occurs. Prediction
may be used to determine potential future attacks depending on in-formation
learnt about terrorists through email and phone conversations. Data
mining can also be used for analyzing web logs and audit trails. Based
on the results of the data mining tool, one can then determine whether
any unauthorized intrusions have occurred and/or whether any unauthor-ized
queries have been posed.
Privacy Considerations
With the World Wide Web, there is now an abundance of data that
one could obtain about individuals and mine the data to extract highly
sensitive or private informa-tion. within seconds. This could result
in serious consequences such as an insurance company de-nying insurance
or a loan agency denying loans based on private information that is
gathered. Therefore an emerging goal of data mining is to prevent
users from mining and extracting sensitive and private information
from the data. This has resulted in a new area called privacy preserving
data mining. Two major approaches are being investigated for conducted
data mining as well as ensuring privacy. In one approach, called the
perturbation approach, the data is perturbed or ran-domized and mining
is carried out on the modified values. The goal is to ensure that
the results of the mining would not deviate from the results obtained
from mining the original data sets. In the second approach, called
the multiparty approach, the idea is that each party does not know
any data except its own data and the results of mining. This is accomplished
by using cryptographic proto-cols.
Current debate among the counter-terrorism experts, policy makers,
civil liberties unions and hu-man rights lawyers is about how much
privacy should one give up in order to carry out data mining and surveillance
for homeland security? Counter-terrorism experts ask what the alternatives
are if a government is to combat terrorism effectively? Should it
wait until privacy violations occur and then prosecute or do should
it wait until national security disasters occur and then gather informa-tion?
That is, how can one have privacy but at the same time ensure security?
It is critical that data mining technologists, lawyers, policy makers
and privacy advocates work together to arrive at an acceptable solution.
Conclusion
Without a doubt data mining is a necessary technology for applications
in all walks of life. How-ever due to the false positives and false
negatives present in the data mining results, it is important that
the humans are in the loop and use the data mining results for guidance.
Furthermore, data mining causes privacy violations. The challenge
is to conduct data mining to obtains useful results but at the same
time ensure the privacy of individuals.