Secure Nearest Neighbour Search Over Encrypted Data Using Intel SGX: Discover Personalization in Social Networks Privacy and Security

\documentclass[conference]{IEEEtran}

\usepackage{blindtext, graphicx,caption}

%\usepackage[ruled,norelsize]{algorithm2e}

\usepackage{array,paralist,hyperref,amssymb,amsmath}

\usepackage{subcaption}

\usepackage[utf8]{inputenc}

\usepackage{pgfplots}

\pgfplotsset{width=8.2cm,height=2.35in,compat=1.8}

\usepackage{algorithm}

%\usepackage{algpseudocode}

\usepackage{algorithmic}

\usepackage{pifont}

\usepackage{lipsum}

\usepackage{amsthm}

\usepackage{amssymb}

\usepackage{amsmath}

\usepackage{etoolbox}

\ifCLASSINFOpdf

% \usepackage[pdftex]{graphicx}

% declare the path(s) where your graphic files are

%\usepackage[ruled]{algorithm2e}

\else

% or other class option (dvipsone, dvipdf, if not using dvips). graphicx

% will default to the driver specified in the system graphics.cfg if no

\fi

% correct bad hyphenation here

\hyphenation{op-tical net-works semi-conduc-tor}

\begin{document}

% paper title

\title{Secure Nearest Neighbour Search over Encrypted Data using Intel SGX}

%Privacy Preserving Image-centric Social Discovery using Deep Neural Network and

% author names and affiliations

% use a multiple column layout for up to three different

% affiliations

\author{\IEEEauthorblockN{Kazi Wasif Ahmed}

\IEEEauthorblockA{Computer Science\\

University of Manitoba\\

Winnipeg, Canada\\

Email: wasif@cs.umanitoba.ca}

\and

\IEEEauthorblockN{Md. Momin Al Aziz}

\IEEEauthorblockA{Computer Science\\

University of Manitoba\\

Winnipeg, Canada\\

Email: azizmma@cs.umanitoba.ca}

\and

\IEEEauthorblockN{Noman Mohammed\\}

\IEEEauthorblockA{Computer Science\\

University of Manitoba\\

Winnipeg, Canada\\

Email: noman@cs.umanitoba.ca}

}

% make the title area

\maketitle

\begin{abstract}

%\boldmath

% \blindtext[1]

%Image sharing is one of the most popular features facilitated by different social media sites such as Facebook, Flickr, Pinterest and Instagram. People frequently use these social media sites to express different aspects of their life with peers they are connected through these sites.

The attractive features of cloud platforms such as low cost, high availability and scalability are encouraging social, health and other service providers to store their client data throughout the cloud. Though cloud-based services offer many advantages, privacy and security of these sensitive user data are a major concern from the commercial and user perspective. Compromised cloud server can leak sensitive information about the users which is uncovered in recent times (i.e., iCloud leaks). One practical solution to mitigate these concerns is to encrypt the data before outsourcing to the cloud. Although, encryption protects the data from unauthorized access but it increases the computational complexity to perform the necessary operation (e.g., similarity or nearest neighbour search) which is the key requirement for different health and social discovery applications. There are some recent endeavors which enable nearest neighbour search over encrypted data without compromising the computational efficiency. However, most of them focus on query accuracy and prove to be computationally intensive for high dimensional larger data sets. In this paper, we propose an efficient scheme for secure nearest neighbour search over encrypted high dimensional data with some state of the art metric. The proposed design utilizes the advantages of Intel$\textregistered$ Software Guard Extensions (Intel$\textregistered$ SGX) architecture and Locality Sensitive Hashing (LSH) for secure similarity computation and nearest neighbour search.

\end{abstract}

% IEEEtran.cls defaults to using nonbold math in the Abstract.

% in the abstract anyway.

% Note that keywords are not normally used for peerreview papers.

\begin{IEEEkeywords}

Similarity Search; Intel SGX; Nearest Neighbour Search; Secure Search in the Cloud

\end{IEEEkeywords}

\section{Introduction}

%The popularity of online social networks is on the rise facilitating the individuals having interconnection among them.

%The correct reasoning of an user’s mentality and behaviour can be very useful for generating relevant personalized recommendation in social networking sites \cite{cheng2016online}.

%what is user similarity or why we need it?

The advent of ubiquitous computing and Internet of Things (IoT) has given the computer science community a vast amount of data. This data is shared every day through health, social networking, and other online applications. This massive volume of data demands equivalent amount of storage spaces resulting in the popularity of the cloud architecture. Therefore, cloud computing is becoming prevalent by removing the burden of large-scale data centre management in a cost effective manner. At the same time, storing the sensitive data in the untrusted cloud servers raises privacy implications of the contributing individuals of the service. To minimise the disclosure risk, the data should be secured or go through privacy preserving mechanism before outsourced to the cloud. Although encrypting the data solves a part of the problem of privacy leakage, it adversely affects the time required to perform arbitrary computation on the data \cite{kuzu2012efficient}.

\par

%operation

One such computation is the nearest neighbour search which is one of the vital operation for any social discovery or recommendation platform. The problem consists of a collection of data items that are characterised by some features (age, likes, affiliations etc) and a query that has specified values known as the target features. The relevance between the query and the data items is measured by a similarity metric. The goal is to retrieve the most similar data records according to that specific metric and target features.

%The goal is to retrieve the items closest or most similar whose similarity against the specified query is greater than a predetermined threshold under the utilized metric.

However, this task of measuring similarity among millions of records in real time is a computationally expensive one. This complexity of searching increases dramatically when the data is encrypted. If we choose the data to be encrypted, we need to adopt a cryptographic scheme which guarantees substantial security of the data along with searching over the encrypted data.

\subsection{Motivating Scenarios}

The motivation for this work comes from scenarios where nearest neighbour search must be applied on a large volume of data, and therefore performance becomes critical. We describe here two scenarios which require secure similarity search implementations.

\noindent\textbf{Secure Social Discovery:} Social discovery based on similarity search is one of the most prevailing services facilitated by different social networking sites such as Facebook, Twitter, Linked-In, Google+ etc. Moreover, virtual friendships are built upon social similarity providing its users with an opportunity to explore new dimensions of empathy. The similarity among different users can be measured considering parameters like social relationships or graphs. An enormous amount of data in a different format are shared by the users of social networking sites. As an example, a number of images are shared every day on Instagram are close to \textit{60 million}~\cite{souza2014not}. This massive volume of data itself demands a massive amount of storage spaces for social media sites as back-end data centre.

\begin{figure*}[t]

\centering

\captionsetup{justification=centering,margin=1cm}

\includegraphics[height=2.6in, width=7in]{system_architecture_4}

\caption{Proposed architecture of the system\label{fig:architecture}}

\end{figure*}

Outsourcing this data directly into the cloud can be considered as a potential solution for low-cost storage. Some of the popular social media sites have already adopted this approach. For example, Instagram was using the Amazon cloud before Facebook acquired them in 2014 \cite{instagram}. Unfortunately, incorporating insecure cloud storage (or even in-house storage) for storing user’s sensitive data possesses severe security threat. One prime example which underlines this scenario is the recent incident regarding the hacking of the iCloud image storage service and celebrity photo leakage \cite{ lewis2014icloud}. This is why user data should be encrypted before outsourcing to the cloud or data centres to protect client’s privacy.

%Furthermore, in social discovery applications to search for other similar individuals an user has to enclose his personal preferences to service providers for building user profiles. After that, a nearest neighbour search operation is performed based on his/her user profile and results are recommended by the service providers. User profiles are sensitive as it contains the personal preferences of the users. To minimize the disclosure risk, the data should be encrypted before outsourcing to the cloud. Therefore, the nearest neighbour search operation is required to perform over the encrypted data.

\noindent\textbf{Secure Mobile Sensor based Health Applications:}

The appealing features of mobile sensor-based health applications are attracting more users than the past. No surprise that both clinicians and patients are using mobile health apps to deliver or receive much needed care. Moreover, mobile health app adoption is doubled in recent years \cite{Pollack}. To store the huge amount of health accords of users the service providers needs an extensive amount of storage space. Using cloud storage as a data repository can be considered as a feasible solution to overcome storage constraint problem.

%The health records of a user is more sensitive as it can not be replaced if stolen.

%different dataq could be more interesting –shad

%credit card data, email addresses, social security numbers, employment information and

Health records typically contain personal and family medical history, sensitive disease association, genetic disorders, susceptibility to diseases and current medical conditions. These records are sensitive as most of them will remain valid for years \cite{Zorabedian} and can be linked to a person.

%The newly released Sixth Annual Benchmark Study on Privacy and Security of Healthcare Data, conducted by Ponemon Institute, found that mobile device insecurity is a top security threat that worries health care organizations \cite{Arevalo}. The recent complaints against Fitbit, Jawbone, and other fitness wristbands companies illustrates the threats. None of the four companies gives users proper notice about changes in their apps’ terms and conditions. On top of that, the companies do not fully explain who they may share user records with, or for how long they retain that records\cite{Baker}.

\par

Therefore, the health records should be encrypted if it is outsourced to the cloud to preserve the privacy of the users. For example, in some of the dietary apps, users input their food habits and ask for recommendation. This is a stringent private information as it reveals the lifestyle of that user. Leakage of such information will certainly effect the health insurance and sometimes employment which is reported overtime~\cite{lindor2013preserving,gottlieb2001us}.

\par

Aforementioned, if the service providers’ encrypt and outsource this personal information, they need to execute computation over these encrypted records to provide different service. Nearest neighbour search over encrypted data can be considered one vital functionality as it offers similar users (or patients) who have similar attributes in records. This kind of recommendations are the fundamental aspect of any social networks.

%It is already adopted by $31\%$ of health care organizations to mitigate the security concerns \cite{Zorabedian}.

%Still, cloud services should enable efficient search on the encrypted data to ensure the benefits of a full-fledged cloud computing environment.

%Social discovery based on similarity search is one of the most prevailing service facilitated by different social networking sites such as: Facebook, Twitter, Linked-In, Google+ etc. Moreover, virtual friendships are build upon social similarity providing its users the accord to explore new dimensions of empathy. The similarity among different users can be measured considering parameters like social relationships or graphs. Community detection and suggestions are also one of the feature of these social web services where user similarity or preferences are highly valued.

%usage of image in similarity

%Measuring similarity among millions of users in real time is a computationally expensive task. Specially when images are considered to be an useful feature to describe the preference of an individual. Photos uploaded by the users can actually be interpreted as the reflection of their interests, lifestyle and activities they get frequently engaged with~\cite{pandey2015capturing}. On top of that, many social media applications are now directly considering the image contents as one of the major factors for discovering users who share common interest and make the decision for recommending a friend or group \cite{yuan2014enabling}.

%image-centric social discovery

%Eventually, image-centric social discovery requires evaluating and quantifying content similarity over a very large number of images shared by users \cite{yuantowards}. According to the user statistics published by Instagram in May 2014, \textit{200 million} registered users uploaded \textit{20 billion photographs} and shared those to that date. Moreover, on top of that the amount of images are shared everyday in Instagram are close to \textit{60 million}~\cite{souza2014not}. This sheer volume of images itself demands massive amount of storage spaces for social media cites as back-end data center.

%As, the mobile devices tend to have a limited storage capacity, users often upload their images to the cloud servers such as Amazon Cloud Drive, Drop-box, Google Drive etc \cite{zhang2015pop}.

%cloud based image-centric social discovery

%Outsourcing these massive amount of visual information directly into the cloud can be considered as a potential solution for storage constraint problem. Some of the popular social media cites have already adopted this approach. For example, Instagram was using the Amazon cloud before Facebook acquired them in 2014 \cite{instagram}. Unfortunately, incorporating cloud storage for storing user’s uploaded images possesses severe security threat. One prime example which underlines this scenario is the recent incident regarding the hacking of the iCloud image storage service and celebrity photo leakage \cite{ lewis2014icloud}. Similarly there exists some other glaring examples which demonstrates the importance of the security threat for cloud based visual data storage.

%However, in social discovery applications to search for other similar individuals an user has to enclose his personal preferences to service providers for building user profiles. After that, a nearest neighbour search operation is performed based on his/her user profile and results are recommended by the service providers. Apart from the sensitivity of images shared by the users, user profiles are also sensitive as it contains the personal preferences of the users. To minimize the disclosure risk, the data should be encrypted before outsourcing to the cloud. If we choose the data to be encrypted, we need to adopt a cryptographic scheme that allows to search over encrypted data.

\subsection{Our Contributions}

The nearest neighbour search operation requires compute and compare distances between the query point and the data points. To make the search secure all these operations should be performed in the cipher-text using the existing state of the art cryptographic techniques such as homomorphic encryption, garbled circuit or order preserving encryption etc. However, fully homomorphic encryption based techniques are computationally expensive for real life applications. While the use of garbled circuits requires huge interaction overheads between the client and the servers which are inefficient for supporting a huge number of queries in search applications \cite{wang2016practical}. Order preserving encryption \cite{boldyreva2009order} can only evaluate comparisons and additive homomorphic encryption can only be applied for addition. These operations alone are not sufficient for performing a nearest neighbour search on encrypted data.

Therefore, instead of adopting existing solutions which rely on time exhaustive cryptographic protocols, we have designed an efficient framework utilizing the advantage of recent trustworthy and readily available computation architecture Intel$\textregistered$ SGX. The SGX architecture is divided into trusted and untrusted sections enabling secure computation over sensitive data within protected enclaves called through the application. The contributions of this paper can be summarized as follows:

\begin{itemize}

\item The main contribution of this paper is to design a practical solution for nearest neighbour search over encrypted data. Our design integrates secure hardware primitives to achieve both security and efficiency in nearest neighbour search over encrypted data.

\item We utilized Intel SGX architecture as the secure computation block which solves several challenges. To the best of our knowledge, this is the first attempt of such kind where SGX is experimented on securely searching for nearest neighbour in a cloud architecture.%not directly addressed by the hardware like how the communication will take place between the application and the protected enclave.

%\item We conducted extensive experiments to justify the utility of out proposed mechanisms. The experimental data sets contain 25K images from Flickr image data set (MirFlickr-25K \cite{huiskes2008mir}).

\item We provided a rigorous accuracy and utility analysis of our proposed approach. The similarity is measured using different metrics like cosine similarity, jaccard similarity, hamming and euclidean distance. The computation time comparison of these different metrics is provided in Section-\ref{experimental_results}. As the user profiles are high dimensional, the similarity search operation is more optimized by combining Locality Sensitive Hashing (LSH) with the similarity evaluation metrics. All these operations are executed inside the protected enclave.

\end{itemize}

The rest of the paper is organized as follows: Section-\ref{system_overview} gives an overview of our proposed model. Section-\ref{problem_formulation} formally presents the research problem addressed in this paper. Section-\ref{proposed_model} describes the steps of our proposed method. Our performance evaluation and security analysis is done in Section-\ref{experimental_results}. Section-\ref{related_works} discusses the related works. Finally, We summarize our work and conclude the paper in Section-\ref{conclusion}.

%\vspace{-.2cm}

\section{System Overview}

\label{system_overview}

Current social networking and health-care applications focus mainly on different user’s profile attributes to determine user similarity. The question is how the trusted service providers store or use the shared sensitive information. Unwise handling of this information may raise privacy implications.

\subsection{Architecture and Entities}

Fig.~\ref{fig:architecture} presents an overview of our proposed system. Our proposed system has three main entities: \textit{users, service providers, and third-party cloud server}. The role of each entity is described below:

\begin{itemize}

\item \textit{Users:} \textit{Users} are the people who use the social networking and mobile sensor-based health applications.

\item \textit{Service Providers:} This party consists of the entities who offers services to the users such as Facebook, Flickr, Fitbit etc. In our model, they facilitate the communication between users and third-party cloud server.

\item \textit{Third Party Cloud Server:} Service providers outsource their large volume of data to the third party cloud server i.e. Amazon cloud. These third party servers are semi-honest (also known as honest but curious) \cite{}.

\end{itemize}

\subsection{System Service Flow}

From the architecture (Fig.~\ref{fig:architecture}), the information flow in the proposed framework occurs in two steps— \textit{first}, storing the user’s data (securely) in the third party cloud server and \textit{second}, processing nearest neighbour search operation on the encrypted dataset.

\subsubsection{Storing Data in the Third Party Cloud Server}

The steps involved in storing data in the third party cloud server can be enumerated as —

\begin{itemize}%

\item When the users upload their contents, the contents are received by the \textit{Social Networking Service Provider}.

\item After receiving data from users, service provider first build a user profile for each user.

\item After generating user profiles, similarity search index using LSH buckets is generated by service providers.

\item Then the user profiles and search index are encrypted and uploaded to the cloud.

\end{itemize}

\subsubsection{Similarity Search Request}

When a similarity search request is generated by a user, the following steps are required —

\begin{itemize}%

\item The request is issued from a user. The user sent this request to the \textit{Social Networking Service Provider}.

\item The \textit{Social Networking Service Provider} encrypts the query and forwards the secure query to the third party cloud server.

\item Upon receiving the secure request, the third party cloud server performs a nearest neighbour search operation on the stored encrypted profiles. This search operation is done via LSH bucket index inside the protected enclave.

\item The third party cloud server sends the encrypted search result to the \textit{Social Networking Service Provider}.

\item Upon receiving the search result, the service provider decrypts it and returns the results to the user.

\end{itemize}

To make the system more practical and user-friendly, the encryption is done by the trusted \textit{Social Networking Service Provider} rather than the user himself. Because for mobile users, it requires high latency and bandwidth to encrypt data from their side \cite{ferreira2015IEEE}.

\subsection{Threat Model}

In our proposed model, we assume that the attacker is unable to physically open and manipulate the SGX based processors that reside in the cloud service provider’s data centre. We consider primary attacks from the third party cloud server which is assumed, to be honest, but curious about learning the contents of user’s shared data and user’s interests. We focus on preserving the privacy of user’s shared data outsourced in the cloud and provide secure similar user query service. The service provider and all the user’s in the system are assumed to be trustworthy. We do not consider other possible attacks, such as malicious user attack at this time.

%—–shad changes

\section{Problem Formulation}\label{problem_formulation}

%\section{Problem Formulation} \label{problem_formulation}

Suppose a hospital or care giving facility wants to outsource its patients data such age, height, weight, temperature, medical history etc to the cloud and provides similarity search for doctors or other research organizations. The objective of this paper is to enable cloud to perform a nearest neighbour search based on well established metric in a privacy-preserving manner.

%\section{Background}

%\textbf{Dataset.}

\subsection{Dataset}

%need to change this

In this work, we have run the experiment using the datasets containing 25K images from Flickr image dataset (MirFlickr-25K \cite{huiskes2008mir}). We have used convolutional neural network \cite{krizhevsky2012imagenet} to generate binary valued user profiles. A sample dataset representation is shown in Table-\ref{table:1}. In Table-\ref{table:1} each row represents an user, and each column except the first one represents a visual object class. If $c_i,_j =1$, then the visual object in the $j-th$ column is associated with the preference of the user of $i-th$ row.

%\vspace{-1.5em}

\begin{table}[h!]

\centering

\caption{Sample dataset with binary feature attributes}\label{table:1}

\begin{tabular}{ |c|c|c|c|c|c|c }

\hline

$ User & \[$f_1$\] & \[$f_2$\] &\[$f_3$\] & \[$f_4$\] & \[$f_5$\]$ \\

\hline\hline

\[$u_1\] & 0 & 1 & 0 & 1 & 1 \\

\[$u_2\] & 0 & 1 & 0 & 1 & 1 \\

\[$u_3\] & 1 & 0 & 0 & 1 & 1 \\

\[$u_4\] & 1 & 0 & 1 & 1 & 1 \\

\[$u_5\] & 0 & 1 & 0 & 1 & 0 \\

\[$u_6\] & 1 & 0 & 1 & 1 & 1 \\

\hline

\end{tabular}

\end{table}

\setlength{\textfloatsep}{5pt}

%\vspace{-1.5em}

\textbf{Convolutional Neural Network.}

A Convolutional Neural Network (CNN)- is comprised of one or more convolutional layers and then followed by one or more fully connected layers as in a standard multilayer neural network. Every network layer acts as a detection filter for the presence of specific features or patterns present in the original data. A sample example of user profiles consisting of binary $user\times feature$ matrix is shown in Table-\ref{table:1}. The service provider extracts visual features (e.g. dog, flowers, portraits) from uploaded images by users using CNN and then generates feature vectors. Then all the feature vectors are combined and normalised to generate user profile.

\subsection{Privacy Requirement}

The diversity and massive size of user-shared content and inter-user content interactions have raised new concerns for social discovery and other online applications, among which privacy preservation of user-preferences is a major challenge \cite{li2011privacy}. In this work, we preserve the privacy of user’s shared contents by encryption before outsourcing the data into the cloud server. The required computation regarding nearest neighbour search such as similarity measurement among user profiles is performed over encrypted data using Intel SGX architecture.

\textbf{Intel SGX.}

SGX is a set of x86-64 ISA extensions that enables execution of code in protected environment termed as enclave without trusting anything other than the processor and the code placed inside the enclave \cite{mckeen2013innovative}.

\textbf{Memory Protection:} An enclave can be visualised as

protected container residing in the application’s address space. The processor protects the enclave and controls access to enclave memory. Any instructions trying to read or write the memory of a running enclave from outside the enclave will fail \cite{costanintel}.

Furthermore, SGX protects the privacy and integrity of the pages in an enclave. The cache-resident and enclave data is protected by CPU access controls.

Data integrity is protected by encryption when written to memory. In any case, if the data in memory is modified, a subsequent load will signal a fault \cite{baumann2015shielding}.

\textbf{Enclave Access:} In addition to protecting the content

and integrity of memory mappings of an enclave, SGX also monitors the transfer of function calls inside and outside of the enclave. In this way, it protects the enclave’s register file from OS exception handlers \cite{baumann2015shielding}. The function written inside enclave can be called from untrusted code through a process transferring the call to a user-defined entry point inside the enclave \cite{schuster2015vc3}. Enclave execution may be interrupted due to system calls. In this cases, the processor saves the current state of the register to enclave memory and resume execution outside of the enclave.

\textbf{Sealing and Attestation:} SGX also supports sealed storage and attestation enabling a remote system to verify securely that

specific software has been loaded within an enclave \cite{mckeen2013innovative}. It can establish shared secrets allowing it to bootstrap an end-to-end encrypted channel with the enclave.

While enclave creation, a secure hash known as a measurement

is established in the enclave’s initial state. Afterwards, the

enclave may retrieve a report signed by the processor

that proves its identity and communicates using a unique

value (such as a public key) with, another local enclave \cite{baumann2015shielding}.

Ultimately, the processor manufacturer (e.g., Intel) is the

root of trust for attestation.

\subsection{User Similarity Measurement}\label{similarity}

The similarity of user profiles is computed inside the enclave of SGX in secured manner. If we consider two user profile vector $u_1$ as A and $u_3$ as B from TABLE-\ref{table:1} and $n$ as the total size of the feature set, then the similarity between two user profile ($A, B$) can be measured using different metric as defined below:

\textit{Cosine Similarity:} Cosine similarity is a measure of similarity between two non zero vectors regarding of inner product space. The cosine similarity between two vectors A and B can be calculated as:

%eqn should undo

%\[

%Sim (A,B) = \frac{\sum_{i=1}^{n} A_i*B_i}{\sqrt\[\sum_{i=1}^{n}(A_i^2)* \sum_{i=1}^{n}(B_i^2)\]}}

%\]

%\vspace{-1.1cm}

%\vspace{-1.5cm}

%From Table-\ref{table:1} cosine similarity between A and B is:

\begin{equation}

Sim (A,B) = \frac{\sum_{i=1}^{n} A_i*B_i}{\sqrt\[\sum_{i=1}^{n}(A_i^2)* \sum_{i=1}^{n}(B_i^2)\]}}

\end{equation}

= \frac{(0*1+1*0+0*0+1*1+1*1)}{\sqrt\[(o^2+1^2+0^2+1^2+1^2)*(o^2+1^2+0^2+1^2+1^2)\]}

\vspace{-1.2cm}

= \frac{(2)}{\sqrt\[(3)*(3)\]}

\vspace{-1.1cm}

= \frac{2}{3}

%\vspace{-1cm}

\textit{Jaccard Similarity:} The Jaccard similarity between two sets can be defined as the ratio of the size of their intersection divided by the size of their union. The Jaccard similarity between two vectors A and B can be calculated as,

Sim (A,B) = |A \cap B|/|A \cup B|

%\vspace{-1cm}

= \frac{2}{4\]}

\vspace{-1.3cm}

\textit{Hamming Distance:} Hamming distance between two vectors of equal length is the number of positions at which the corresponding elements are different. The hamming distance between two vectors A and B can be calculated as,

%\[

\begin{equation}

d(A,B) = \sum_{i=1}^{n}(A_i-B_i)

= 3

\end{equation}

%\]

%\vspace{-.12cm}

\textit{Euclidean Distance:} Euclidean distance is the straight-line distance between two vectors in Euclidean space. The euclidean distance between two vectors A and B can be calculated as,

%\[

\begin{equation}

d(A,B) =\sqrt\[\sum_{i=1}^{n}(A_i-B_i)^2\]

=\sqrt\[(0-1)^2+(1-0)^2+(0-0)^2+(1-1)^2+(1-1)^2\]

\[ =\sqrt\[3\]\]

\end{equation}

%=\sqrt\[3\]

%\]

\vspace{-2cm}

%\[ =\sqrt\[3\]\]

%\vspace{-1.1cm}

\subsection{Utility Requirement}

Finding nearest neighbours for a target user by measuring similarity among all the user’s or in a brute force way is very time-consuming and impractical for large datasets with many users. Indeed, we have to adopt a scheme to reduce the number of candidates who will be qualified for similarity measurement during a nearest neighbour search. The utility preservation of search result along with reasonable computation time is desirable.

\section{Proposed Approach}\label{proposed_model}

The approach begins with user sharing contents with social networking and mobile application service providers to find a match with other users who have common interests. After sharing contents, users may request for finding top-$k$ similar users to the application service provider.

\subsection{Preprocessing Steps at Service Provider End}

%\textbf{Prepossessing Steps at Service Provider End:}

To enable cloud to remotely perform a secure nearest neighbour search for providing similar user query service there are some preprocessing steps taken by the service provider.

\textbf{LSH Bucket Construction.}

During similarity search, objects are classified by a collection of relevant features and are represented as vectors or points in a high dimensional space. If we are given a collection of vectors, similarity can be measured using different approaches measuring the distance or closeness among the vectors. For example, using Euclidean distance measure, Jaccard similarity, Cosine similarity etc as listed in Section-\ref{problem_formulation}. Performing pairwise comparisons in a set is time-consuming because the number of comparisons grows geometrically with the size of the set. Most of those comparisons, furthermore, are unnecessary because they do not result in matches. The combination of minhash and LSH seeks to solve these problems.

Minhashing is the process of converting large sets to short signatures, while preserving similarity. They make it possible to compute possible matches only once for each element, so that the cost of computation grows linearly rather than exponentially. Computing of minhash signatures requires random permutations. However, It is possible perform a random permutation by a random hash function that maps rows to same number of buckets. We can assume that, our hash function $h$ permutes row $r$ to position $h(r)$ in the permuted order. Let $SIG(i, c)$ be the element of the signature matrix for the $ith$ hash function and column $c$. Initially, we set SIG(i, c) to $\infty$ for all $i$ and $c$. We denote $j-th$ column of $i-th$ row as $c_i_,_j$. Minhash signatures for each user entity $u$ is computed following the steps of Algorithm-\ref{algorithm:3}.

\begin{algorithm}[t]

\begin{algorithmic}[1]

\caption{Computing Minhash Signatures }\label{algorithm:3}

\renewcommand{\algorithmicrequire}{\textbf{Input:}}

\renewcommand{\algorithmicensure}{\textbf{Output:}}

\REQUIRE User profile matrix

\ENSURE Minhash Signatures

% \\ \textit{Initialisation} :

\STATE Compute $h_1(r), h_2(r), . . . , h_n(r)$.

\FOR{\textbf{each} $c$}

%\STATE carry out some processing

%\STATE For each column c,

\IF { $c_{i,j} = 1 $}

\FOR{$i=1$ to $n$}

\STATE Set $SIG(i, c)$ to the smaller of the current value of $SIG(i, c)$ and $h_i(r)$.

\ENDFOR

\ENDIF

\ENDFOR

\RETURN $ SIG(i, c)$

\end{algorithmic}

\end{algorithm}

Suppose, the each columns in Table-\ref{table:2} represents user profiles. We have chosen two hash functions: $h1(x) = x+1$ mod $5$ and $h2(x) = 3x+1$ mod $5$. The values of these two functions applied to the row numbers are given in the last two columns of

Table-\ref{table:2}.

% \vskip-1.5ex

%\setlength{\textfloatsep}{5pt}

\begin{table}[h!]

\centering

%\captionsetup[table]{aboveskip=0pt}

%\captionsetup[table]{belowskip=10pt}

\caption{Hash functions computed for the user profile matrix}\label{table:2}

\begin{tabular}{ |c|c|c|c|c|c|c|c|c}

\hline

Row& u_1 & u_2 & u_3 & u_4 & u_5 & \[$x+1$ mod 5\] & \[$3x+1$ mod 5\] \\

\hline\hline

0 & 0 & 0 & 1 & 1 & 0 & 1 & 1\\

1 & 1 & 1 & 0 & 0 & 1 & 2 & 4\\

2 & 0 & 0 & 0 & 1 & 0 & 3 & 2\\

3 & 1 & 1 & 1 & 1 & 1 & 4 & 0\\

4 & 0 & 0 & 0 & 0 & 0 & 0 & 3\\

\hline

\end{tabular}

\end{table}

\setlength{\textfloatsep}{5pt}

%\vskip-2ex

Now, let us simulate the algorithm for computing the signature matrix.

Initially, the signature matrix consists of all $\infty$ as shown in Table-\ref{table:3}

%\vskip-2ex

%\setlength{\textfloatsep}{5pt}

\begin{table}[h!]

\centering

\caption{Computing minhash signatures (Step-1)}\label{table:3}

\begin{tabular}{ |c|c|c|c|c|c|c }

\hline

& \[u_1\] & \[u_2\] &\[u_3\] & \[u_4\] & \[u_5\] \\

\hline\hline

h_1 & \infty & \infty & \infty & \infty & \infty \\

h_2 & \infty & \infty & \infty & \infty & \infty \\

\hline

\end{tabular}

\end{table}

\setlength{\textfloatsep}{5pt}

\vskip-1.5ex

At first, we consider row 0 of Table-\ref{table:2}. We see that the values of $h1(0)$ and $h2(0)$ are both 1. The row numbered 0 has 1â€™s in the columns for sets $u_3$ and

$u_4$, so only these columns of the signature matrix can change. As 1 is less than $\infty$, we do in fact change both values in the columns for $u_3$ and $u_4$. The current estimate of the signature matrix is shown in Table-\ref{table:4}.

\vskip-2ex

\begin{table}[h!]

\centering

\caption{Computing minhash signatures (Step-2)}\label{table:4}

\begin{tabular}{ |c|c|c|c|c|c|c }

\hline

& \[u_1\] & \[u_2\] &\[u_3\] & \[u_4\] & \[u_5\] \\

\hline\hline

h_1 & \infty & \infty & 1 & 1 & \infty \\

h_2 & \infty & \infty & 1 & 1 & \infty \\

\hline

\end{tabular}

\end{table}

\setlength{\textfloatsep}{5pt}

%\vskip-1.5ex

Now, we operate on the row numbered 1 in Table-\ref{table:2}. This row has 1 in $u_1$, $u_2$ and $u_5$, and its hash values are $h1(1) = 2$ and $h2(1) = 4$. The new signature matrix is shown in Table-\ref{table:5}.

\vskip-2ex

\begin{table}[h!]

\centering

\caption{Computing minhash signatures (Step-3)}\label{table:5}

\begin{tabular}{ |c|c|c|c|c|c|c }

\hline

& \[u_1\] & \[u_2\] &\[u_3\] & \[u_4\] & \[u_5\] \\

\hline\hline

h_1 & 2 & 2 & 1 & 1 & 2 \\

h_2 & 4 & 4 & 1 & 1 & 4 \\

\hline

\end{tabular}

\end{table}

\setlength{\textfloatsep}{5pt}

%\vskip-1.5ex

If we continue in this way, finally we get the signature matrix as Table-\ref{table:6}. It is clear from the Figure that signature matrix of $u_1$, $u_2$ and $u_5$ are identical. The signature matrix of $u_3$ and $u_4$ are also identical. So, their original user profiles should be similar. We can measure the similarity using Jaccard Similarity.

\vskip-2ex

\begin{table}[h!]

\centering

\caption{Computing minhash signatures (Final Step)}\label{table:6}

\begin{tabular}{ |c|c|c|c|c|c|c }

\hline

& \[u_1\] & \[u_2\] &\[u_3\] & \[u_4\] & \[u_5\] \\

\hline\hline

h_1 & 2 & 2 & 1 & 1 & 2 \\

h_2 & 0 & 0 & 0 & 0 & 0 \\

\hline

\end{tabular}

\end{table}

\setlength{\textfloatsep}{5pt}

But computing minhash signatures and the linear comparison is time-consuming for big datasets. So, we combine LSH with min-hashing. LSH is an efficient algorithm to run fast nearest neighbour search in high dimensional spaces. \cite{andoni2006near}. One general approach to LSH is to hash items several times technically so that similar items are more likely to be hashed to the same bucket than dissimilar items. The hashing is done in several stages and if any of the hash signatures of user profiles matches with target user profile hash signature in any of the stages then the user profiles are considered as candidate pair. The methods work in such a way that the dissimilar pairs will never hash to the same bucket, and therefore will never be checked \cite{rajaraman2010finding}. At the end of this phase, $n$ number of buckets are generated having bucket length of $m$. Where $n$ is the number of users in the data set and $m$ is the number of candidates.

\noindent\textbf{Encryption Phase.}

After, bucket generation the buckets and user profiles are encrypted and uploaded to the cloud. We have used symmetric encryption via AES-128 in CBC mode. In CBC mode all the blocks are chained together and encryption is randomised by using an initialization vector (IV) \cite{paar2010more}. The inclusion of IV makes CBC mode encryption non-deterministic. We make sure that the buckets have same length $m$, where $m$ contains the largest set of candidates. Any duplicate elements in the buckets are removed. As a result, the same candidate is never checked for twice while similarity measurement. If the number of candidates in any buckets are less than $m$ then the empty positions are full-filled by some random values within the range of the number of users $n$. That makes the buckets indistinguishable. Algorithm-\ref{algorithm:2} presents the main steps of encryption phase.

\begin{algorithm}

\begin{algorithmic}[1]

\caption{Encryption Phase}\label{algorithm:2}

\renewcommand{\algorithmicrequire}{\textbf{Input:}}

\renewcommand{\algorithmicensure}{\textbf{Output:}}

\REQUIRE

u=${u_1,u_2…..u_n}$: user profile set

k:secret key

iv: initialization vector

m: maximum bucket size

\ENSURE $U$, $B$: encrypted user profiles and buckets

% \\ \textit{Initialisation} :

\FOR{\textbf{each} $user_i$}

\STATE Build user profile ($u_i$)

\STATE Compute buckets ($b_i$) consisting of candidate pairs for user ($u_i$) using Minhashing and LSH

\IF {bucket size \textless $m$}

\STATE add random padding $r$ to set the size $m$

\ENDIF

\STATE $B_i$=BucketEnc($b_i$,$k$,$iv$)

\STATE $U_i$=UserEnc($u_i$,$k$,$iv$)

\STATE Upload $U_i$ and $B_i$ to the cloud

\ENDFOR

\RETURN $U$, $B$

\end{algorithmic}

\end{algorithm}

\subsection{Processing Steps at Cloud Server End}

%\textbf{Processing Steps at Cloud Server End:}

To enable cloud to remotely perform the secure nearest neighbour search for providing similar user query service there are some processing steps executed by the cloud server. Algorithm-\ref{algorithm:3} presents the main steps of secure nearest neighbour search at cloud server end. All the operations regarding computation of nearest neighbour for query user profile are executed inside the enclave of SGX.

\begin{algorithm}

\begin{algorithmic}[1]

\caption{Secure Nearest Neighbour Search}\label{algorithm:3}

\renewcommand{\algorithmicrequire}{\textbf{Input:}}

\renewcommand{\algorithmicensure}{\textbf{Output:}}

\REQUIRE

encrypted query (q), encrypted buckets (B)

\ENSURE encrypted search result (r)

% \\ \textit{Initialisation} :

\STATE Decrypt query user profile ($u_q$)

\STATE Decrypt corresponding bucket ($b_q$)

\FOR{\textbf{each} candidate $user_i$ in ($b_i$)}

\STATE d=ComputeDistance($u_i$,$u_q$);

\STATE $r$= RankResult($u_i$,$d$)

\ENDFOR

\RETURN $r$

\end{algorithmic}

\end{algorithm}

\noindent\textbf{Decryption Phase.} When a user sends a similar user query to the service providers the query is encrypted and forwarded to the cloud server end. According to the query the buckets are decrypted inside the enclave and the bucket which match with the query are qualified for similarity search.

\noindent\textbf{Similarity Computation.} The elements of the corresponding buckets are only checked for nearest neighbor computation. After that, the candidate user profiles are decrypted and the similarity is measured using the defined metrics in Section-\ref{similarity}. Then the results are ranked according to the distance between the candidates and query user profiles. The query retrieves the top similarity match from the cloud and returns the encrypted matching results to the service provider.

After receiving the encrypted nearest neighbour search results, the service provider process it and returns the results the target user. The ranking results can be more utilised by integrating with other efficient social discovery approaches combining different features to make the application framework more acceptable and accurate. For example one can combine our proposed approach with \cite{ahmed2015cohesion}, \cite{chu2013friend}, \cite{raiupeople} to further improve ranking result and accuracy. The derived final results are returned to user combining all the features in addition to userâ€™s interest. The accuracy can be different varying the features combined with userâ€™s interests.

%\textit{Nearest Neighbour Search over Anonymized Data:}

%In recent years, though cloud based solution to storage constraints offers many advantages, it has some privacy and security concerns. To reduce these concerns, we are outsourcing the content rich sensitive images after encryption and user profiles after anonymization to the cloud. The similarity search query is executed over anonymized data. Although anonymization of data loose some utility, it prevents security breach. The similarity search over anonymized data is performed utilizing fast nearest neighbour search in high dimensional spaces with combination of LSH and Jaccard Similarity.

%\section{Security Analysis}\label{security_analysis}

\section{Experimental Evaluation}

\label{experimental_results}

In this section we demonstrate the different experimental analysis we executed over the problem and our solution.

\subsection{Implementation Details}

The implementation of the prototype of the framework as described in Fig. \ref{fig:architecture} was done in two phases. In the first phase, we used the convolutional neural network for extracting features from the images. For this purpose, we used a PC with Intel core i5-4590 CPU and 8GB RAM. For training the network we used a MATLAB toolbox called MatConvNet \cite{vedaldi15matconvnet}. MatConvNet toolbox implements \textit{CNN} and this network is very effective for extracting visual features from the images. We used a pre-trained model called imagenet available in the toolbox. After training our model, we extracted the image features using the trained model. The extracted features were used to generate binary valued user profile vectors. For the construction of LSH buckets, we used a Java library called tdebatty/java-LSH \cite{tdebatty_Lsh}. This library efficiently implements Locality Sensitive Hashing (LSH) with min-hashing as described in \cite{rajaraman2010finding}.

In the second phase, we evaluate the performance of nearest neighbour search using SGX architecture. For the implementation of AES-128, we have used the library kokke/tiny-AES128-C \cite{aes_128}. The experiments were run under Intel Core-i7-6700 processor with 8 GB of RAM. The code was compiled with Microsoft Visual Studio C++ compiler version- 2012. We measured the computation time of finding nearest neighbour inside the protected enclave of SGX with LSH technique and without applying LSH.

\subsection{Nearest Neighbour Search Performance Evaluation}

We simulated nearest neighbour search operation using our model in two ways.

\textit{First}, we created a variable number of binary valued user profiles by randomly assigning a binary value to the corresponding image feature from the dataset which represents user’s visual preferences. After that, for some target user’s, we found out the nearest $top- 10$ user profiles. In the first phase similarity computation and search is performed over encrypted data inside the enclave with the defined metrics in Section-\ref{similarity}. In this phase, LSH is not applied. The result is shown in Figure-\ref{fig:accuracy1}.

Essay: Secure Nearest Neighbour Search Over Encrypted Data Using Intel SGX: Discover Personalization in Social Networks Privacy and Security

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: