Document Recovery from Bag-of-Word Indices
dc.contributor.author | Fillmore, Nathanael | en_US |
dc.contributor.author | Goldberg, Andrew B. | en_US |
dc.contributor.author | Zhu, Xiaojin | en_US |
dc.date.accessioned | 2012-03-15T17:23:48Z | |
dc.date.available | 2012-03-15T17:23:48Z | |
dc.date.created | 2008 | en_US |
dc.date.issued | 2008 | en_US |
dc.identifier.citation | TR1645 | en_US |
dc.identifier.uri | http://digital.library.wisc.edu/1793/60654 | |
dc.description.abstract | Motivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a document's bag-of-words (BOW) vector or other type of index, reconstruct the original ordered document. We investigate a variety of index types, including count-based BOW vectors, stopwords-removed count BOW vectors, indicator BOW vectors, and bigram count vectors. We formulate the problem as hypothesis rescoring using A* search with the Google Web 1T 5-gram corpus. Our experiments on five domains indicate that if original documents are short, the documents can be recovered with high accuracy. | en_US |
dc.format.mimetype | application/pdf | en_US |
dc.publisher | University of Wisconsin-Madison Department of Computer Sciences | en_US |
dc.title | Document Recovery from Bag-of-Word Indices | en_US |
dc.type | Technical Report | en_US |
Files in this item
This item appears in the following Collection(s)
-
CS Technical Reports
Technical Reports Archive for the Department of Computer Sciences at the University of Wisconsin-Madison