MINDS @ UW-Madison

To Join or Not to Join? Thinking Twice about Joins before Feature Selection

Show simple item record

File(s):

Files Size Format View
OptFSSIGMODTR.pdf 4.297Mb application/pdf View/Open
Key Value Language
dc.contributor.author Kumar, Arun
dc.contributor.author Naughton, Jeffrey
dc.contributor.author Patel, Jignesh M.
dc.contributor.author Zhu, Xiaojin
dc.date.accessioned 2015-12-03T22:43:01Z
dc.date.available 2015-12-03T22:43:01Z
dc.date.issued 2015-11-27
dc.identifier.citation TR1828 en
dc.identifier.uri http://digital.library.wisc.edu/1793/73836
dc.description.abstract Closer integration of machine learning (ML) with data processing is a booming area in both the data management industry and academia. Almost all ML toolkits assume that the input is a single table, but many datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins to obtain features from all base tables and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this work, we show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can "avoid joins safely". We identify the core technical issue that could cause accuracy to decrease in some cases and analyze this issue theoretically. Using simulations, we validate our analysis and measure the effects of various properties of normalized data on accuracy. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins in order to help analysts exploit this runtime-accuracy tradeoff. Experiments with multiple real normalized datasets show that our rules are able to accurately predict when joins can be avoided safely, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods. en
dc.description.provenance Submitted by Jody Hoesly (jhoesly@wisc.edu) on 2015-12-03T22:43:01Z No. of bitstreams: 1 OptFSSIGMODTR.pdf: 4296824 bytes, checksum: cc092ae74c51dc79031c2922b2679b7c (MD5) en
dc.description.provenance Made available in DSpace on 2015-12-03T22:43:01Z (GMT). No. of bitstreams: 1 OptFSSIGMODTR.pdf: 4296824 bytes, checksum: cc092ae74c51dc79031c2922b2679b7c (MD5) Previous issue date: 2015-11-27 en
dc.subject Advanced analytics en
dc.subject joins en
dc.subject machine learning en
dc.subject feature selection en
dc.subject feature engineering en
dc.title To Join or Not to Join? Thinking Twice about Joins before Feature Selection en
dc.type Technical Report en

Part of

Show simple item record