Preventing Information Leakage in the Search Engine
This thesis covers the design, implementation, and evaluation of a search engine which can give each user a customized index based on the documents they are authorized to view. A common solution available today for this situation is to filter the results of a query based on the list of documents a user has access to. In this scenario, it is possible for information to leak from the search engine because the filtering takes place after the results are ranked. Ranking algorithms are usually based on information which considers characteristics of the entire corpus when calculating the score a document will receive. This type of information in the index must be cleaned before it is used to judge the relevant documents for a user query, otherwise data leakage is possible. Cleaning this information at query time might have a dramatic effect on query performance which would discourage use. The work presented here takes this sensitive information and calculates it for every user authorized to view the documents at index time. At query time, the search engine uses a filtered global index for selecting relevant results, but ranks the results using the information stored for the individual user’s authorized view of the index. Different designs are compared, but the same concept is present in all implementations. An open source index, Apache’s Lucene, was used as a starting point for this work. All modifications were made to Lucene and then compared to an unmodified Lucene for performance evaluations. The findings are that it is feasible to include access control in a basic search engine without incurring dramatic loss in performance.
PublisherUniversitetet i Tromsø
University of Tromsø
The following license file are associated with this item: