Data Confidentiality Workshop
Home Workshop Agenda Participants Travel Information

 

Contact

 


WORKSHOP ON DATA CONFIDENTIALITY

September 6-7, 2007 in Arlington, VA

White Paper & Bio


DATA MINING , SECURITY AND PRIVACY

Bhavani Thuraisingham
The University of Texas at Dallas


What is Data Mining?
Data mining is the process of posing queries and extracting useful information, patterns, and trends often previously unknown from large quantities of data possibly stored in databases. Essentially, for many organizations, the goals of data mining include improving marketing capabilities, detect-ing abnormal patterns, and predicting the future based on past experiences and current trends. There is clearly a need for this technology. There are large amounts of current and historical data being stored. Therefore, as databases become larger, it becomes increasingly difficult to support decision making. In addition, the data could be from multiple sources and multiple domains. There is a clear need to analyze the data to support planning and related functions of an enterprise.
Various terms have been used to refer to data mining. These include knowledge/data/information discovery and knowledge/data/information extraction. Some define data mining to be the process of extracting previously unknown information while knowledge discovery is defined as the process of making sense out of the extracted information.
Data Mining Technologies, Techniques and Applications
Data mining techniques include those based on statistical reasoning techniques, inductive logic programming, machine learning, fuzzy sets, and neural networks, among others. The data mining outcomes include classification, finding rules to partition data into groups; association, finding rules to make associations between data; and sequencing, finding rules to order data. Essentially one arrives at some hypothesis, which is the information extracted, from examples and patterns observed. These patterns are observed by posing a series of queries where a query may depend on the responses obtained to the previous queries posed.
Data mining integrates multiple technologies. These include data management such as database management and data warehousing, statistics, machine learning, decision support, and others such as visualization and parallel computing. There are a series of steps involved in data mining. These include getting the data organized for mining, determining the desired outcomes from mining, se-lecting tools for mining, carrying out the mining, pruning the results so that only the useful ones are considered further, taking actions based on the data mining results, and evaluating the results of the actions to determine benefits.
While numerous developments have been made in data mining such as improved accuracy, there are still many challenges. For example, due to the large volumes of data, how can the algorithms determine which technique to select and what type of data mining to perform? More importantly how can one reduce the false positives and false negatives that are present in the data mining re-sults? Often the data may be incomplete and/or inaccurate due to data entry errors. At times there may be redundant information, and at times there may not be sufficient information. Therefore im-proving the data quality is critical if the results of data mining are to be useful. Many of the devel-opments in data mining have been on mining relational and structured databases. The current trends include mining web data, mining distributed and heterogeneous databases, and mining mul-timedia data.
Data mining has many applications including in medical, financial, marketing and sales as well as in security. For example, neural networks may be trained to detect a particular disease based on the symptoms. Prediction techniques may be used for market forecasting. Association rule mining techniques may be used to make connections between products so that a corporation can market certain products together and improve sales. Link analysis techniques may be used to find correla-tions between suspicious people. Anomaly detection techniques may be used to detect unauthozerd intrusions into a computer system or a network. However using data mining for security applica-tions increases the concerns for privacy as even naïve users can now use these data mining tools and extract highly sensitive and private information.
Data Mining for Security Applications
The threats to homeland security include attacking buildings, destroying critical infrastructures such as power grids and telecommunication systems. Data mining techniques are being investi-gated to find out who the suspicious people are and who is capable of carrying out terrorist activi-ties.
Consider association rule mining techniques. The goal here is to find items that go together. For example, consider the scenario where John comes from Country X and he has an association with James who has a criminal record. Furthermore, an unusually large percentage of people from Coun-try X have carried out terrorist attacks. Because of the associations between John and Country X, as well as between John and James, and James and criminal records, one may conclude that John has to be under observation. While association-rule mining techniques are essentially intelligent search techniques, link analysis uses graph theoretic methods for detecting patterns by following the nodes and links. For example consider the chain “A is seen with B and B is a friend of C and D travel and D has a criminal record.” The question is what conclusions can one draw about A?
Next consider clustering techniques. For example, people with origins in country X belonging to a certain religion may be grouped into Cluster A. People with origins in country Y who are less than 50 years old may form another Cluster B. These clusters are formed based on their travel patterns, eating patterns, spending patterns and behavior patterns. While clustering divides the population not based on any pre-specified condition, classification divides the population based on some pre-defined condition. The condition is found based on examples. For example, one can form a profile of a terrorist with the following characteristics: male less than 30 years of age and belonging to a certain religious group and of a certain ethnic origin. This means that all males less than 30 years belonging to the same religion and the same ethnic origin will be classified into this group and could possibly be placed under observation.
Another data mining outcome is anomaly detection. A good example here is learning to fly an air-plane without wanting to learn to takeoff or land. People in general want to get a complete training course in flying. However there may be some individuals who want to learn flying but do not care about take off or landing. This is an anomaly.
Data mining is also being applied to cyber security problems. These include problems such as in-trusion detection and auditing. For example, anomaly detection techniques could be used to detect unusual patterns and behaviors. Link analysis may be used to trace the viruses to the perpetrators. Classification may be used to group various cyber attacks and then use the profiles to detect an at-tack when it occurs. Prediction may be used to determine potential future attacks depending on in-formation learnt about terrorists through email and phone conversations. Data mining can also be used for analyzing web logs and audit trails. Based on the results of the data mining tool, one can then determine whether any unauthorized intrusions have occurred and/or whether any unauthor-ized queries have been posed.
Privacy Considerations
With the World Wide Web, there is now an abundance of data that
one could obtain about individuals and mine the data to extract highly sensitive or private informa-tion. within seconds. This could result in serious consequences such as an insurance company de-nying insurance or a loan agency denying loans based on private information that is gathered. Therefore an emerging goal of data mining is to prevent users from mining and extracting sensitive and private information from the data. This has resulted in a new area called privacy preserving data mining. Two major approaches are being investigated for conducted data mining as well as ensuring privacy. In one approach, called the perturbation approach, the data is perturbed or ran-domized and mining is carried out on the modified values. The goal is to ensure that the results of the mining would not deviate from the results obtained from mining the original data sets. In the second approach, called the multiparty approach, the idea is that each party does not know any data except its own data and the results of mining. This is accomplished by using cryptographic proto-cols.
Current debate among the counter-terrorism experts, policy makers, civil liberties unions and hu-man rights lawyers is about how much privacy should one give up in order to carry out data mining and surveillance for homeland security? Counter-terrorism experts ask what the alternatives are if a government is to combat terrorism effectively? Should it wait until privacy violations occur and then prosecute or do should it wait until national security disasters occur and then gather informa-tion? That is, how can one have privacy but at the same time ensure security? It is critical that data mining technologists, lawyers, policy makers and privacy advocates work together to arrive at an acceptable solution.

Conclusion
Without a doubt data mining is a necessary technology for applications in all walks of life. How-ever due to the false positives and false negatives present in the data mining results, it is important that the humans are in the loop and use the data mining results for guidance. Furthermore, data mining causes privacy violations. The challenge is to conduct data mining to obtains useful results but at the same time ensure the privacy of individuals.

 

Dr. Bhavani Thuraisingham

University of Texas at Dallas

 

 

Biographical Data

Dr. Bhavani Thuraisingham joined The University of Texas at Dallas (UTD) in October 2004 as a Professor of Computer Science and Director of the Cyber Security Research Center in the Erik Jonsson School of Engineering and Computer Science. She is an elected Fellow of three profes-sional organizations: the IEEE (Institute for Electrical and Electronics Engineers), the AAAS (American Association for the Advancement of Science) and the BCS (British Computer Society) for her work in data security. She received the IEEE Computer Society’s prestigious 1997 Techni-cal Achievement Award for “outstanding and innovative contributions to secure data manage-ment.” Her research interests are in Assured information sharing and trustworthy semantic web; secure geospatial data management; and Data mining for security applications. Prior to joining UTD, Thuraisingham worked for the MITRE Corporation for 16 years which included an IPA (In-tergovernmental Personnel Act) at the National Science Foundation. Her work in information se-curity and information management has resulted in over 80 journal articles, over 200 refereed con-ference papers three US patents. She is the author of eight books in data management, data mining and data security.
Prof. Thuraisingham’s website is http://www.utdallas.edu/~bxt043000/