Big Data: Preserving Privacy by Design
Information systems in the health sector have undergone significant changes making it possible to collect, store, and process huge amounts of digital records that may hold the key to future population health breakthroughs. However, linking between diverse data systems is complicated by privacy issues as well as data being recorded differently (10/12/58 or Oct. 12, 1958), changing over time (maiden vs. married last names), erroneous (transposed dates during data entry) or simply not included. So how do we use this new wealth of data, commonly termed “big data,” in a way that maintains privacy and assures it is accurate as well?
Hye-Chung Kum, Ph.D., associate professor at the Texas A&M Health Science Center School of Rural Public Health, thinks the answer lies in developing new methodologies for extracting information and outlines her framework in the Journal of the American Medical Informatics Association.
In “Privacy Preserving Interactive Record Linkage (PPIRL),” Kum emphasizes that it is critical to understand the distinction between identity disclosure (e.g., who the person is) and sensitive attribute disclosure (e.g., does this person have cancer). She maintains that identity disclosure has little potential for harm on its own though the sensitive attribute disclosure is what results in harm.
“Privacy preserving data integration is key to any data intensive population research,” states Kum. “If we define the privacy goal of privacy preserving record linkage as a guarantee against attribute disclosure, we can develop systems that allow both privacy protection and high quality record linkage.”
Human-based third-party linkage centers have been the accepted norm internationally to date. However, Kum believes a more flexible computerized third-party linkage platform, Secure Decoupled Linkage (SDLink), should be considered based on three core privacy principles. First, SDLink separates the identifying data from the sensitive data using encryption. Second, through chaffing (adding fake data) and changing the label of the data set, SDLink prevents attribute inference that can occur through group disclosure. For example, if someone you know is on the cancer registry (group disclosure), they must have cancer. However, this attribute disclosure can be eliminated if you knew that the list was fake data (people who did not have cancer are also on the list) or if you did not know this was a cancer registry. Finally, identity disclosure is minimized by recoding of variables (e.g., gender is recoded as being different, same, or missing rather than male or female). As a result, only the information that is essential for record linkage is revealed.
“Never before in history have we had more data to use for population research,” says Kum. “But to use big data appropriately, we must understand the 4V’s – lots of (volume), complex (variety), data that are continuously generated (velocity), and also have much uncertainty (veracity).”
Kum believes variety and veracity are the two main challenges for big data in health services research, and the key is “to understand the minimum information required for accurate linkage and then to design protocols to reveal, in a secure manner, only that information.”