Anonymity for Classification
Releasing person-specific data in its most specific state poses a threat to individual privacy. Consider a table about patient's information on Diagnosis, Zip code, Birthdate, and Sex. If a description on (Zip code, Birthdate, Sex) is so specific that not many people match it, releasing the table will lead to linking a unique or a small number of individuals with the sensitive information on Diagnosis.
The problem of anonymity for classification is about transforming a given data set to satisfy two goals: The privacy goal is specified by theanonymity on a combination of attributes called a virtual identifier, where each description on a virtual identifier is required to be shared by some minimum number of records in the table. A generalization taxonomy tree is specified for each categorical attribute in a virtual identifier. Another goal is making the released data useful to classification. These two goals are really dealing with two types of information: The privacy goal requires to mask sensitive information, usually specific descriptions that identify individuals, whereas the classification goal requires to extract general structures that capture trends and patterns. If generalization is performed "carefully", identifying information can be masked while still preserving the trends and patterns for classification.
What is TDS?
Top-Down Specialization (TDS) is a practical and efficient program that generalizes a given table to a state that masks sensitive information and remains useful for modeling classification. The generalization of data is implemented by specializing or detailing the level of information in a top-down manner until a minimum privacy requirement is violated. This top-down specialization is natural and efficient for handling both categorical and continuous attributes. The outputs of TDS include a generalized version of the given table and some dynamically generated taxonomy trees. The output format is compatible with the C4.5 classifier. User can build a quality classifier without further data transformation. The project is developed at Simon Fraser University. This software is free for academic and research purpose. If it results in further publication, we would appreciate a citation to the following publication:
B. C. M. Fung, K. Wang, and P. S. Yu. "Top-Down Specialization for Information and Privacy Preservation", In Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 5-8, 2005.
[Full paper: pdf, ps] [Slides: pdf, ppt]
How does TDS Work?
TDS generalizes a table to satisfy the anonymity requirement while preserving its usefulness to classification. It generalizes the table byspecializing it iteratively starting from the most general state. At each step, a general (i.e. parent) value is specialized into a specific (i.e. child) value for a categorical attribute, or an interval is split into two sub-intervals for a continuous attribute. Each specialization is guided by maximizing the information gain and minimizing the anonymity loss. This process is repeated until further specialization leads to a violation of the anonymity requirement.
The TDS algorithm was experimentally evaluated and compared to state-of-the-art privacy protection methods. TDS uses a heuristic to determine each specialization, drastically reducing the search space of generalization. Experiments show that TDS generalizes a given table to satisfy a broad range of anonymity requirements without sacrificing significantly the usefulness to classification. Furthermore, TDS is extremely efficient and scalable. For example, it can generalize 45 thousand records within 10 seconds and 1 million records within 10 minutes.
Features At A Glance
- Generalize data to protect sensitive information
- Preserving high quality structure for classification
- Flexible privacy requirement
- User's specified taxonomy trees
- Dynamically generated taxonomy trees for continuous attributes
- Efficient and scalable
- Anytime solution
- Output data is ready for building a C4.5 classifier
- Processor: Pentium II or better
- RAM: 128MB
- HDD: 10MB free disk space
- OS: Windows 9x/ME/NT/XP
Download TDS 1.0
TDS 1.0 is now available for trial download from the following link: http://ddm.cs.sfu.ca/dmsoft/download.htmlThe Adult data set can also be obtained from: