Large-scale learning from multimodal videos
Abstract: In this talk, we present recent progress on large-scale learning of multimodal video representations. We start by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results on zero shot prediction and video captioning. Next, we present an approach for video question answering which relies on cross-modal supervision with a textual question answer module. We show state-of-the-art results for video question answering without any supervision (zero-shot VQA). We then present the recent VideoCC dataset, which transfers image captions to video and allows obtaining state-of-the-art performance for zero-shot video and audio retrieval and video captioning. Next, we present a model for audio-visual automatic speech recognition (AV-ASR) and conclude the presentation with recent work on navigation and robot manipulation given language instructions.
Bio: Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at Inria, where she is a research director. Dr. Schmid is a member of the German National Academy of Sciences, Leopoldina and a fellow of IEEE and the ELLIS society. She was awarded the Longuet-Higgins prize in 2006, 2014 and 2016 and the Koenderink prize in 2018, both for fundamental contributions in computer vision that have withstood the test of time. She received an ERC advanced grant in 2013, the Humbolt research award in 2015, the Inria & French Academy of Science Grand Prix in 2016, the Royal Society Milner award in 2020 and the PAMI distinguished researcher award in 2021. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), an editor-in-chief for IJCV (2013--2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. Starting 2018 she holds a joint appointment with Google research.
Cornell University, USA
Visual Appearance and Discovery from Micron to Planet Scale
Abstract:We can"see” the world from micron resolution in CT images to planet scale with satellite imagery. The availability of visual appearance data at this range of scales is unprecedented, with more modalities of data and scale of imagery than ever before. This data presents a unique opportunity to understand the visual appearance of objects in scenes at one end of the spectrum, and to collectively discover planet-scale events at the other end of the spectrum.
In this talk, I will describe my group's research on better visual understanding and discovery, including graphics models for realistic visual appearance and rendering, reconstruction of shape and materials, and unsupervised recognition for visual discovery of planet-scale trends across geography and time. Visual intelligence applications range from commerce to sustainability.
Bio:Kavita Bala is the inaugural dean of Cornell Ann S. Bowers College of Computing and Information Science at Cornell University. Bala received her S.M. and Ph.D. from the Massachusetts Institute of Technology (MIT). Bala leads research in computer vision and computer graphics in visual discovery, recognition and search; material modeling and acquisition using physics and learning; physically based scalable rendering; and perception. Bala is the recipient of the SIGGRAPH Computer Graphics Achievement Award (2020), and she is an Association for Computing Machinery (ACM) Fellow (2019) and Fellow of the SIGGRAPH Academy (2020). Bala has served as the Editor-in-Chief of Transactions on Graphics (TOG). She co-founded GrokStyle, a visual recognition AI company, which drew IKEA as a client, and was acquired by Facebook in 2019. Bala has received multiple teaching awards and has authored the graduate-level textbook "Advanced Global Illumination”.
Cornell Tech, USA
Representations and Geometry for Multimodal Learning
Abstract: The advent of deep neural networks has brought significant advancements in the development and deployment of novel AI technologies. Recent large-scale neural network architectures have shown significantly better performance for object classification, segmentation, scene understanding and multimodal representations. Samsung Research has focused on incorporating these neural network models across Samsung's billions of devices and users. But how can we understand how the representations of sensor input signals are transformed by deep neural networks? I will show how insights can be gained by analyzing the high-dimensional geometrical structure of these representations as they are reformatted in neural network hierarchies.
Bio: Dr. Daniel Dongyuel Lee is the Tisch University Professor in Electrical and Computer Engineering at Cornell Tech and Executive Vice President and Head of the Global AI Center for Samsung Research. He received his B.A. summa cum laude in Physics from Harvard University and his Ph.D. in Condensed Matter Physics from the Massachusetts Institute of Technology. He was also a researcher at Bell Labs in the Theoretical Physics and Biological Computation departments. He is a Fellow of the IEEE and AAAI and has received the NSF CAREER award and the Lindback award for distinguished teaching. He was also a fellow of the Hebrew University Institute of Advanced Studies in Jerusalem, an affiliate of the Korea Advanced Institute of Science and Technology and organized the US-Japan National Academy of Engineering Frontiers of Engineering symposium and Neural Information Processing Systems (NeurIPS) conference. His group focuses on understanding general computational principles in biological systems and on applying that knowledge to build autonomous systems.
Nanjing University, China
Open-environment machine learning
Abstract: With the great success of machine learning, nowadays, more and more practical tasks involving open-environment scenarios, where important factors are subject to change, are present to the community. It becomes even more challenging as data are usually being accumulated with time, like streams, whereas it is hard to train the machine learning model after collecting all data as in conventional studies. This talk will briefly introduce some advances in this line of research.
Bio: Zhi-Hua Zhou is Professor of Computer Science and Artificial Intelligence at Nanjing University. His research interests are mainly in machine learning and data mining, with significant contributions to ensemble learning, multi-label and weakly supervised learning, etc. He has authored the books "Ensemble Methods: Foundations and Algorithms", "Machine Learning", etc., and published more than 200 papers in top-tier journals or conferences. Many of his inventions have been successfully transferred to industry. He founded ACML (Asian Conference on Machine Learning), served as Program Chair for AAAI-19, IJCAI-21, etc., General Chair for ICDM'16, SDM'22, etc., and Senior Area Chair for NeurIPS and ICML. He is series editor of Springer LNAI, on the advisory board of AI Magazine, and serves as editor-in-chief of Frontiers of Computer Science, associate editor of AIJ, MLJ, IEEE TPAMI, ACM TKDD, etc. He is a Fellow of the ACM, AAAI, AAAS, IEEE, and recipient of the National Natural Science Award of China, the IEEE Computer Society Edward J. McCluskey Technical Achievement Award, the CCF-ACM Artificial Intelligence Award, etc.