Document Recovery from Bag-of-Word Indices
Show full item record
File(s):
- Author(s)
-
Fillmore, Nathanael; Goldberg, Andrew B.; Zhu, Xiaojin
- Publisher
- University of Wisconsin-Madison Department of Computer Sciences
- Date
- Mar 15, 2012
- Abstract
- Motivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a document's bag-of-words (BOW) vector or other type of index, reconstruct the original ordered document. We investigate a variety of index types, including count-based BOW vectors, stopwords-removed count BOW vectors, indicator BOW vectors, and bigram count vectors. We formulate the problem as hypothesis rescoring using A* search with the Google Web 1T 5-gram corpus. Our experiments on five domains indicate that if original documents are short, the documents can be recovered with high accuracy.
- Permanent link
-
http://digital.library.wisc.edu/1793/60654
- Export
-
Export to RefWorks
Part of
Show full item record