Achieving Optimal K-Anonymity Parameters for Big Data

Main Article Content

Mohammed Essa Al-Zobbi, Mr.
Seyed Shahrestani, Dr.
Chun Ruan, Dr.


Access Control, Anonymization, k-anonymity, Big Data, MapReduce


Datasets containing private and sensitive information are useful for data analytics. Data owners cautiously release such sensitive data using privacy-preserving publishing techniques. Personal re-identification possibility is much larger than ever before. For instance, social media has dramatically increased the exposure to privacy violation. One well-known technique of k-anonymity proposes a protection approach against privacy exposure. K-anonymity tends to find k equivalent number of data records. The chosen attributes are known as Quasi-identifiers. This approach may reduce the personal re-identification. However, this may lessen the usefulness of information gained. The value of k should be carefully determined, to compromise both security and information gained. Unfortunately, there is no any standard procedure to define the value of k. The problem of the optimal k-anonymization is NP-hard. In this paper, we propose a greedy-based heuristic approach that provides an optimal value for k. The approach evaluates the empirical risk concerning our Sensitivity-Based Anonymization method. Our approach is derived from the fine-grained access and business role anonymization for big data, which forms our framework.

Abstract 757 | PDF Downloads 2


Al-Zobbi, M., Shahrestani, S., & Ruan, C. (2016). Sensitivity-based anonymization of big data. Paper presented at the Local Computer Networks Workshops (LCN Workshops), 2016 IEEE 41st Conference on.

Basu, A., Nakamura, T., Hidano, S., & Kiyomoto, S. (2015). k-anonymity: Risks and the Reality. Paper presented at the Trustcom/BigDataSE/ISPA, 2015 IEEE.

Bayardo, R. J., & Rakesh Agrawal, R. J. (2005). Data privacy through optimal k-anonymization (pp. 217-228). USA.

Daries, J. P., Reich, J., Waldo, J., Young, E. M., Whittinghill, J., Ho, A. D., . . . Chuang, I. (2014). Privacy, anonymity, and big data in the social sciences. Communications of the ACM, 57(9), 56-63.

Fung, B. C. M., Wang, K., & Yu, P. S. (2007). Anonymizing Classification Data for Privacy Preservation. Knowledge and Data Engineering, IEEE Transactions on, 19(5). doi: 10.1109/TKDE.2007.1015

Guller, M. (2015). Big Data Analytics with Spark A Practitioner's Guide to Using Spark for Large Scale Data Analysis: Berkeley, CA : Apress : Imprint: Apress, 2015.

Hariharan, R., Mahesh, C., Prasenna, P., & Kumar, R. V. (2016). Enhancing privacy preservation in data mining using cluster based greedy method in hierarchical approach. Indian Journal of Science and Technology, 9(3).

Institute, N. C. (2013). Accessing the 1973-2013 SEER Data. from

Kabir, E., Mahmood, A., Wang, H., & Mustafa, A. (2015). Microaggregation sorting framework for k-anonymity statistical disclosure control in cloud computing. IEEE Transactions on Cloud Computing.

Lu, R., Zhu, H., Liu, X., Liu, J., & Shao, J. (2014). Toward efficient and privacy-preserving computing in big data era. IEEE Network, 28(4), 46-50. doi: 10.1109/MNET.2014.6863131

Meyerson, A., & Williams, R. (2004). On the complexity of optimal K-anonymity. In C. Beeri (Ed.), PODS '04 (pp. 223-228): ACM.

Morgenstern, M. (1987). Security and inference in multilevel database and knowledge-base systems (Vol. 16): ACM.

Motwani, R., & Xu, Y. (2007). Efficient algorithms for masking and finding quasi-identifiers. Paper presented at the Proceedings of the Conference on Very Large Data Bases (VLDB).

Park, H., & Shim, K. (2007). Approximate algorithms for k-anonymity. Paper presented at the Proceedings of the 2007 ACM SIGMOD international conference on Management of data.

Rajeev Motwani, Y. X. (2007). Efficient Algorithms for Masking and Finding Quasi-Identifiers.

Smith, M., Szongott, C., Henne, B., & Von Voigt, G. (2012). Big data privacy issues in public social media. Paper presented at the Digital Ecosystems Technologies (DEST), 2012 6th IEEE International Conference on.

Su, T. A., & Ozsoyoglu, G. (1991). Controlling FD and MVD Inferences in Multilevel Relational Database Systems. IEEE Transactions on Knowledge and Data Engineering, 3(4), 474-485. doi: 10.1109/69.109108

Sweeney, L. (2002). ACHIEVING -ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 571-588. doi: 10.1142/S021848850200165X

Yu, S. (2016). Big privacy: Challenges and opportunities of privacy study in the age of big data. IEEE access, 4, 2751-2763.