top of page
Engineering Sketch

Project Description (Semesters 1 + 2)

Stackelberg games model how groups can defend against attackers by laying out their resources in order to minimize the information that attackers can steal. The principles from these games can be applied to minimizing the amount of data lost from a database during successive cyberattacks.

Background: Stackelberg Games

Stackelberg security games involve two players: an attacker and a defender. The defender’s move is to protect several vulnerable resources in a way that minimizes the utility of the attacker. The attacker’s move is to attempt to steal the vulnerable resources in a way that maximizes their utility. The defender always makes the first move, and the attacker can see how the defender allocates resources. Both the attacker and defender have utility functions which describe how important each vulnerable resource is to them. In the example in Table 1, the attacker’s optimal strategy is to attack Resource A.

​

​

​

​

​

​

​

​

​

​

​

​

​

​

In practice, the utility function of the attacker is not known, but information about the attacker’s utility function can be deduced. In the example above, the defender can adjust its coverage probability in order to give a higher chance of protecting Resource A in the next attack. Thus, the defender can optimize their strategy against future attacks.

Building a Game for Databases

Individual cells of a database can be modeled as the vulnerable resources in a Stackelberg security game. Assume that an attacker snooping on a database can observe the query that is run as well as the response to the query, the attacker cannot directly observe the contents of memory, and that the defender knows when information has been leaked.


The attacker chooses a target segment of the database to fix in memory in order to observe it. When a query is run, the defender must observe which information has been leaked and update their belief on the attacker’s utility function. From here, the defender can optimize their strategy for the new utility function.
 

The vulnerability of the cell can be modeled with a vulnerability score. Whenever information is leaked, the total score of the leaked information is increased. The defender can define thresholds for the total leaked score, and which actions to take at each threshold. Beyond some final threshold, the entire database must be flushed and re-encrypted.

Quantifying Vulnerability

Let V be the vulnerability score. Where a is whether or not the cell was attacked, c is the number of cells attacked per row, d is the distinguishing factor, and N is the cardinality factor. The distinguishing factor can be calculated by counting the number of rows that can be found with the same identifiers as that row (an identifier is something like name). The cardinality factor is 100 if the cardinality of the data (number of possible values) is 2, and 1 for all other cases. 

​

​

​

​

​

​

If a cell is attacked, it does not need to be protected anymore, since the information is already lost. The cardinality factor is considered since when the data is binary, knowing information about one value can give information about the rest of the table; however, this is not the case for other cardinalities (see Defender Utility Function for more explanation). The score rises as the number of cells per row stolen increases, and falls if there are multiple entries with the same identifying information.

Defender Utility Function

The utility function of a defender can be constructed based on the cardinality of the data. The example in Table 2 is of a database of medical patients who have been tested for a rare disease. The status of the test has a cardinality (number of possible values) of 2, which can yield a lot of information about the database.

​

​

​

​

​

​

​

​

​

​

Example Query 1: Suppose a query is run for the last names of all of the patients who tested positive for the disease, and that list of people is returned. Not only does the attacker know who has tested positive, but they also know that everyone else has tested negative, so information about every member of the database has been leaked.


Example Query 2: If a query is run for all patients with last name “Doe,” only two first names will be returned. Even though an equal number of cells were leaked in both queries, this second query leaks no information about Bob Smith because names have a much higher cardinality than test status.

Attacker Utility Function

The attacker’s goal is to build a complete picture of members of the database, so they will prefer to attack the same row repeatedly in order to gain more new information from this row.


For example, if the attacker learns that people with the last name “Doe” and “Smith” have tested positive from Example Query 1 above, they will prefer to learn the first names of these patients. Since the attacker did not learn the last name of patients who have tested negative, they gain less utility from attacking the rows corresponding to those patients.

Next Steps

Formally define the game, specify the intermediate actions that the defender can take, devise a strategy update algorithm, implement it, and test it.

Project Description (Semester 3)

To study a more simple algorithm as a proof of concept, use a probabilistic model to quantify data leakage, and take intermediate actions to reduce this particular risk score. This is applicable to in-memory databases.

Background

A side-channel attack is a form of cyberattack in which a malicious observer is able to use the pattern of memory requests to gain information about the program issuing the requests and the data being transferred, all without directly observing the data or the execution of the program. An attacker can accomplish such an attack by monitoring traffic traveling across the address bus between the processor and DRAM. Database applications access memory frequently, as only part of the database can be kept in the cache at any particular time. For this kind of attack, every time there is a request made to the database, some information may be leaked to the attacker.


If only the address bus is compromised, it is possible to prevent the attacker from gaining information by shuffling all or part of the database, so addresses may no longer correspond to the same data from request to request. In order to perfectly disguise the memory traffic, a shuffle must be performed on every access; however, any shuffle is a high-cost operation that adds to request latencies, so performing a shuffle that often is not practical. One way to overcome this hurdle is to tolerate some data leakage in order to improve performance.


This project assumes the worst case scenario, that the attacker may observe the addresses from all queries issued to the database. It also assumes that the attacker’s goal is to build a complete picture of the records in the database. Since the attacker must distinguish individual records from a series of overlapping queries, the more unique a record is, the more likely it is to be distinguished by any particular query. This model applies particularly to situations like medical records databases.

Proposed Solution

Quantify the data leakage incurred by each query as a risk score for each query. Calculate the risk score based on a function of the estimated probability of observing the specific combination of data highlighted by the request-response pair.


Do this by storing a metadata table for each column in the primary data table, which keeps track of every unique value in the column and its relative frequency. Assuming that each column is roughly independently distributed, the proportion of records matching the data highlighted by the request-response pair is approximately the product of the relative frequencies of each cell highlighted in the record. The lower this product, the rarer the combination of these particular values, and therefore the greater the severity of the leakage.


Keep a running total of the risk scores of all handled queries. Allow the database owner to define a series of intermediate thresholds at which to shuffle all or part of the database. These shuffles can decrease the risk score based on the risk scores corresponding to selecting the shuffled data. Larger thresholds should call for larger shuffles.

Methods

An example dataset was created to simulate a simple medical records table for 50 patients. Two rows of this table are shown in Table 1.

​

​

​

​

​

​

​

​

Thresholds and their corresponding intermediate actions were set as shown below in Table 2:

​

​

​

​

​

​

​

​

​

Three metrics were evaluated based on a single benchmark workload, and were compared based on the total amount of data leakage and the number of times the shuffle operations were called. The following three metrics were evaluated:

 

 

 

 

 

​
 

Capture.JPG
Capture.JPG
Capture.JPG

Proposed Solution

Capture.JPG

Table 3 shows the number of calls to each shuffle method for each metric, and the maximum observed risk score value as a proxy for total data leakage.

​

​

​

​

​

​

​

​

​

Conclusions

The second metric seems to be the least efficient. It introduced the most latency since it made the most calls to the full shuffle method, and also yielded the maximum risk score. The last metric, while leaking the least amount of data and not calling the full shuffle method at all, made 247 overall shuffle calls, so the overall latency for the benchmark would have been far too high. The first metric seems to be the best, yielding an intermediate overall data leakage, few calls to the full shuffle method, and a total of 191 shuffle calls

References

[1] D. Korzhyk, Z. Yin, C. Kiekintveld, V. Conitzer, and M. Tambe, “Stackelberg vs. Nash in Security Games: An Extended Investigation of Interchangeability, Equivalence, and Uniqueness,” Journal of Artificial Intellgence Research, pp. 297–327, Jun. 2011.


[2] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici, “M-score: estimating the potential damage of data leakage incident by assigning misuseability weight,” ACM Insider Threats '10, pp. 13–20, Oct. 2010


[3] Y. Zhou, S. Wagh, P. Mittal, and D. Wentzlaff, “Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks,” 2017 IEEE International Symposium on High Performance Computer Architecture, pp. 337–348, Apr. 2017.

Project Details

Advisor: Dr. Benjamin C. Lee

benjamin.lee@duke.edu

Timeline: August 28th, 2017 - May 1st, 2018 (Semester 1 + 2)

                         August 20th, 2018 - December 15th, 2018 (Semester 3)

Hours per week: 10 hours/week

Time to completion: 285 hours (Semester 1 + 2)

                                 170 hours (Semester 3)

                     455 hours (total)

Connection to GCS Focus

This project connects to "Securing Cyberspace" because it explores possible strategies to prevent unauthorized accesses to private data streams.

bottom of page