Show simple item record

dc.contributor.authorFillmore, Nathanaelen_US
dc.contributor.authorGoldberg, Andrew B.en_US
dc.contributor.authorZhu, Xiaojinen_US
dc.date.accessioned2012-03-15T17:23:48Z
dc.date.available2012-03-15T17:23:48Z
dc.date.created2008en_US
dc.date.issued2008en_US
dc.identifier.citationTR1645en_US
dc.identifier.urihttp://digital.library.wisc.edu/1793/60654
dc.description.abstractMotivated by computer privacy issues, we present the novel problem of document recovery from an index: given only a document's bag-of-words (BOW) vector or other type of index, reconstruct the original ordered document. We investigate a variety of index types, including count-based BOW vectors, stopwords-removed count BOW vectors, indicator BOW vectors, and bigram count vectors. We formulate the problem as hypothesis rescoring using A* search with the Google Web 1T 5-gram corpus. Our experiments on five domains indicate that if original documents are short, the documents can be recovered with high accuracy.en_US
dc.format.mimetypeapplication/pdfen_US
dc.publisherUniversity of Wisconsin-Madison Department of Computer Sciencesen_US
dc.titleDocument Recovery from Bag-of-Word Indicesen_US
dc.typeTechnical Reporten_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

  • CS Technical Reports
    Technical Reports Archive for the Department of Computer Sciences at the University of Wisconsin-Madison

Show simple item record