MSU AVIS Dataset

Available for download here.

MSU-AVIS dataset consists of audio-visual data of subjects freely speaking (text-independent) while walking around in a semi-constrained indoor environment mimicking a real-world surveillance scenario. The data was collected from 50 subjects (among which 16 are female). Audio-visual data, for each subject, were collected across 3 different sessions at varying distances from the camera and microphone. Each session contains 12 videos (RGB images and single-channel audio) samples, each of 5 seconds duration. The video frames in each sample are captured at a resolution of 1920 X 1080 and the corresponding audio is recorded at a sampling rate of 48kHz. The face images exhibit variations due to large stand-off distance from the camera, occlusions, pose, indoor-illumination, expressions, accessories, etc. The audio samples exhibit variations due to the distance of the subject from the microphone, indoor reverberations, background noise, etc. The dataset also contains a gallery set containing high-quality face images and voice audio (text-dependent) from the 50 subjects. Additionally, an auxiliary dataset from a subset of 10 subjects was also collected with a focus on mimicking the biometric recognition challenges specific to surveillance scenarios, such as large pose variations and stand-off distance from camera/microphone.

A. Chowdhury, Y. Atoum, L. Truan, X. Liu, A. Ross, "MSU-AVIS dataset: Fusing Face and Voice Modalities for Biometric Recognition in Indoor Surveillance Videos," Proc. of the 24th IAPR International Conference on Pattern Recognition (ICPR), (Beijing, China), August 2018.

MSU AVIS Dataset

Publications