MSU AVIS Dataset

Available for download here.

MSU-AVIS dataset consists of audio-visual data of subjects freely speaking (text-independent) while walking around in a semi-constrained indoor environment mimicking a real-world surveillance scenario. The data was collected from 50 subjects (among which 16 are female). Audio-visual data, for each subject, were collected across 3 different sessions at varying distances from the camera and microphone. Each session contains 12 videos (RGB images and single-channel audio) samples, each of 5 seconds duration. The video frames in each sample are captured at a resolution of 1920 X 1080 and the corresponding audio is recorded at a sampling rate of 48kHz. The face images exhibit variations due to large stand-off distance from the camera, occlusions, pose, indoor-illumination, expressions, accessories, etc. The audio samples exhibit variations due to the distance of the subject from the microphone, indoor reverberations, background noise, etc. The dataset also contains a gallery set containing high-quality face images and voice audio (text-dependent) from the 50 subjects. Additionally, an auxiliary dataset from a subset of 10 subjects was also collected with a focus on mimicking the biometric recognition challenges specific to surveillance scenarios, such as large pose variations and stand-off distance from camera/microphone.

A. Chowdhury, Y. Atoum, L. Truan, X. Liu, A. Ross, "MSU-AVIS dataset: Fusing Face and Voice Modalities for Biometric Recognition in Indoor Surveillance Videos," Proc. of the 24th IAPR International Conference on Pattern Recognition (ICPR), (Beijing, China), August 2018.