blog




  • Essay / Predicting disclosure risk in the digital database.

    The organization's internal data can increase rapidly over time. To reduce organizational costs, they can choose a third-party storage provider to store the entire data. There is a flight crisis when the provider cannot be trusted. In another scenario, a dealership collects all transaction data and publishes it to the data analytics company for marketing purposes. This can reveal privacy when the company is malicious. For this reason, preserving confidentiality in the database becomes a very important issue. This article concerns the risk of disclosure of predictions in the digital database. We present an efficient noise generation that leverages the Huffman coding algorithm. We also construct a noise matrix that can intuitively add noise to the original value. Furthermore, we adopt the clustering technique before generating noise. The result shows that the noise generation running time of the clustering scheme is faster than that of the dissociation scheme. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get the original essay Technology brings convenience, and the technique of cloud computing has increased in recent years. The organization's internal data can grow quickly. Despite the organization, she can create storage space herself, but she can publish this data to the data analysis company for marketing purposes. Therefore, data mining techniques play an important role in knowledge discovery in databases (KDD). But the malicious data analysis company can record personal data when the organization publishes the statistical database for the benefit of the company. If the company is not trusted, there is a flight crisis. For these reasons, this leads to privacy research becoming more popular in these years. Statistical databases (SDBs) are used to produce a result of statistical aggregates, such as sum, average, maximum and minimum. The results of statistical aggregates do not reveal the contents of a single individual tuple. However, the user can ask many legal questions to infer confidential information from the answers obtained in the database. In recent years, improving the security of statistical databases has received much attention. The security problem in traditional statistical databases involves three different roles [17]: the statistician, whose interest is in obtaining aggregated data; data owner, who wants individual records to be secure; database administrator, who must fulfill both roles above. Privacy challenges in statistical databases are classified into two aspects [15]: for the data owner, he must avoid data theft by hacker, avoid data abuse by service provider and restrict users' right of access; for the user it should hide the query content and the database does not reveal the query details. Many approaches have been proposed. Navarro-Arribas and Torra organize four categories of approaches as follows [16]: 1) Disruptive methods, which modify the original data to achieve a certain degree of confidentiality. They are generally called noise; 2) Non-perturbative methods, this technique hides the data without introducingof error. Unlike perturbative methods, the data is not distorted; 3) Cryptographic methods, which use a classical cryptography system; 4) Synthetic data generation, which generates random data while maintaining a relationship with the original data. In order to protect confidential information in the database. , Statistical Disclosure Control (SDC) is most widely used for a privacy-preserving solution based on statistical data. Micro-aggregation techniques (MAT) are considered to belong to the SDC family and belong to the perturbative methods. The microaggregation method has many attractive features, including robust performance, consistent responses, and ease of implementation [6]. A user is able to obtain useful information since this method would not reduce the information contained in the content. In other words, this method results in minimal information loss. Additionally, we review some approaches to preserving privacy [1-5,8.12-14,17]. In particular, the microaggregation scheme is interesting to be used in statistical databases in these years, because it replaces the original value, with lower distortion, to prevent identity disclosure and forecasts. And the replaced data was not a problem for data analysis or data mining applications. All records in the database can be represented by a data point in coordinate systems. This article considers that a combination of two or more non-confidential attributes, such as age and weight, can be used to link an individual. Such a set of attributes is collectively called a quasi-identifier. A popular approach to replace the original data is to use clustering-based technique to prevent identity disclosure. Therefore, the adversary may be confused when the original data is replaced by a clustering measure. Although the data in the dataset is homogeneous through clustering based technique, there is a problem of prediction disclosure.2.Proposed scheme.The paper concerns the problem of prediction disclosure that the quasi -identifier is generalized by a homogeneous micro-aggregation method. The quasi-identifier has one or more attributes that can be linked to an individual. To summarize, we only consider a two-attribute quasi-identifier. First, all quasi-id values ​​are converted to a data point on the coordinate system. To meet the prediction disclosure, the homogeneous values ​​after the process of the original micro-aggregation method cluster first. Then we generate noise based on the centroid of these groups. In order to improve the noise injection speed, all noise values ​​are grouped into a set, called noise matrix in this paper. Each original value corresponds to a noise value. In this section, we introduce the concept of micro-aggregation and then illustrate Prim's MST-based clustering technique. The main idea of ​​the paper is to generate noise and noise injection procedure. These two will be described in the rest of this section.2.1Preliminary.The microaggregation technique is the family of statistical disclosure control and is applied to numerical data, categorical data, sequences and heterogeneous data [16]. It calculates a value to represent a group and replaces the original value to confuse the opponent. All recordings form a group with its most recent recordingscloser. is a constant value, a threshold, predefined by the data protector. It is higher, the degree of confidentiality is higher, but the data quality is lower. On the other hand, if the privacy level is lower, the privacy level is lower, but the data quality is higher. This is a trade-off between the risk of data disclosure and less loss of information. Although this method may damage the original data and cause data distortion. But this simply ensures low levels of data distortion. This did not affect the operation of the database. Therefore, minimizing information loss is a major challenge of this method. There are two main micro-aggregation operations, namely partition and aggregation, which we describe in detail as follows: Partition: Records are the partition into several disjoint groups, and each group includes at least records. Aggregation: Each record in the group is replaced by the group centroid, which is a value calculated to represent the group.2.2 MST Clustering. We adopt Prim's minimum cost spanning tree clustering technique which is proposed by Lazlo and Mukherjee in 2005 [11]. The first step, the proposed clustering technique is based on Prim's minimum cost spanning tree, which is constructed based on all records in the dataset. Prim's algorithm is a greedy algorithm that finds a spanning tree at minimum cost for an undirected graph with connected edges. It finds a subset of edges to form a minimum cost spanning tree that connects to all nodes, where the total weight of all edges is minimized. Some notation is defined to facilitate discussion. Each record with more attributes in dataset D can be converted into a data point on the coordinate systems and is considered a node u in the minimum cost spanning tree. Node u can be connected to other node v in dataset D and forms an edge e(u,v), u,vD. All edges can be calculated to a value by two random nodes in the dataset. This calculated value can be used as the weight w for each edge. According to Prim's algorithm, it first selects a single node uD and constructs a minimum cost spanning tree F={u}, without edges. The next step of Prim's algorithm selects another node v FD, where v is closest to set F and closest to node u. There is a new edge e(u,v) formed by two nodes u, vD and node v points to parent node u and adds v to the set F, F={u,v}. Each node points to its parent node in the tree, but the initial node points to null. In this case, the node u points to zero. This is an iterative process until F=D. Prim's algorithm selects a single node, considered a root of the tree, in the graph to become a spanning tree at minimum cost. The total weight of all selected edges is minimized. The result of Prim's MST algorithm is shown in Figure 1, where the tree nodes are connected by red lines and the number of weights is close to each edge. The second step, in order to partition all nodes to form a cluster in the MST, we should consider how many edges of the MST are removable. The idea is to visit all edges of the MST from longest to shortest and determine the cut of the edges while retaining the remaining edges. After edge cutting, the MST divides into several subtrees and these can form a cluster. All edges are assigned to a priority queue in descending order. Then we get an edge in sequence from the queue.