Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

dc.contributor.authorHebbar, Rajat
dc.contributor.authorPapadopoulos, Pavlos
dc.contributor.authorReyes, Ramon
dc.contributor.authorDanvers, Alexander F.
dc.contributor.authorPolsinelli, Angelina J.
dc.contributor.authorMoseley, Suzanne A.
dc.contributor.authorSbarra, David A.
dc.contributor.authorMehl, Matthias R.
dc.contributor.authorNarayanan, Shrikanth
dc.contributor.departmentNeurology, School of Medicineen_US
dc.date.accessioned2022-05-18T16:13:04Z
dc.date.available2022-05-18T16:13:04Z
dc.date.issued2021
dc.description.abstractOver the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.en_US
dc.eprint.versionFinal published versionen_US
dc.identifier.citationHebbar R, Papadopoulos P, Reyes R, et al. Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices. EURASIP J Audio Speech Music Process. 2021;2021(1):7. doi:10.1186/s13636-020-00194-0en_US
dc.identifier.urihttps://hdl.handle.net/1805/29057
dc.language.isoen_USen_US
dc.publisherSpringeren_US
dc.relation.isversionof10.1186/s13636-020-00194-0en_US
dc.relation.journalEURASIP Journal on Audio, Speech, and Music Processingen_US
dc.rightsAttribution 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0*
dc.sourcePMCen_US
dc.subjectForeground speech detectionen_US
dc.subjectMultiple instance learningen_US
dc.subjectWearable audioen_US
dc.subjectWeakly labeled audioen_US
dc.titleDeep multiple instance learning for foreground speech localization in ambient audio from wearable devicesen_US
dc.typeArticleen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
13636_2020_Article_194.pdf
Size:
1.43 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: