Keynote Speech

Professor David Bull, University of Bristol

Title: Learning-based Video Compression: from TV to the Metaverse (PDF)

Abstract: After providing context on the role and importance of compression in existing and emerging media delivery use cases, Professor Bull will introduce recent innovations that exploit deep learning to optimise video coding systems. He will first outline the requirements and constraints imposed on practical learning-based codecs and the pros and cons of existing architectures. He will also discuss the importance of diverse and sufficient datasets for training, validating and testing these methods. Differentiating between learning-enabled tools (that enhance conventional hybrid codec architectures) and End to End learnt codecs (that employ radically new approaches to codec implementations), he will compare these in terms of algorithmic structure, candidate approaches, rate-quality performance and complexity. The talk will specifically discuss how new modelling-based approaches, that exploit Implicit Neural Representations, offer potential for both conventional and immersive formats. The lecture will conclude with a review of challenges that remain open for future research.

Bio: David Bull holds the Chair in Signal Processing at the University of Bristol. He is Director of Bristol Vision Institute and leads the MyWorld Strength in Places Programme, a £30m UKRI investment in creative technology R&D, based in the West of England. David has worked widely in the fields of visual computing and visual communications, focusing on media processing challenges relating to compression, enhancement and quality assessment. Much of his research in recent years has exploited machine learning methods to deliver optimized solutions to these challenges. He has published over 600 academic papers and five books on these topics and has won numerous awards for his work. David has delivered several invited/keynote lectures and has organised several leading conferences, most recently acting as General Chair for the 2021 IEEE Picture Coding Symposium. He has also acted as an advisor to major organisations and government agencies across the globe, on research strategy, innovative technologies and research policy. David’s work is often exploited commercially; in 2001 he co-founded ProVision Communications, acting as its Director and Chairman until acquisition in 2011. David is a chartered engineer, a Fellow of the IET and a Fellow of the IEEE.

Professor Kiyoharu Aizawa, University of Tokyo

Title: Building a Realistic Virtual World from 360˚ Videos for Large Scale Urban Exploration (PDF)

Abstract: In this talk, we present a series of works, MovieMap, 360RVW and 360ｘCityCGL RVW, aimed at constructing realistic virtual environments using 360-degree video technology. Visual exploration of urban areas is crucial for various applications, and we mostly rely on tools like Google Street View (GSV). While GSV is widely used, its sparse street images can lead to user confusion and decreased satisfaction.

Forty years ago, the video-based interactive map known as the Aspen Movie Map was created. Unlike the GSV, this map used continuous movie footage. However, the Aspen Movie Map was created using film cameras and analog technology that required significant manual effort and has never been replicated.

We have innovated this concept using the latest technology to create a new MovieMap. This system allows interactive exploration of cities by connecting video segments of original 360-degree walking videos captured along streets, which are analyzed and segmented at video intersections. The user can see the view of the area interactively by changing the direction of movement at the intersection.

We extended MovieMap to a virtual world called 360RVW where users can explore the world with their avatars. They control their avatars to move around the area and interact with other avatars and objects in the space. The world is a projection on a local spherical surface with annotations defining road regions where avatars can move freely.

Further enhancing our system, we incorporate the CityGML model, such as Plateau, which provides 3D volume models of buildings and roads without details such as textures and objects. By optimizing the registration of our 360-degree videos and the CityGML model, we directly texture-map the videos onto the CityCML model. This integration introduces global spatial 3D features into our 360RVW, enabling advanced visualizations such as flood scenario simulations.

Through these developments, we aim to demonstrate the effectiveness of 360-degree videos in creating realistic virtual worlds, bypassing the need for complex modeling required in CG-based virtual environments.

Bio: Kiyoharu Aizawa is a professor with the Department of Information and Communication Engineering, and director of VR center, University of Tokyo. He was a visiting assistant professor with the University of Illinois from 1990 to 1992. His research fields are multimedia, image processing, and computer vision, with a particular interest in interdisciplinary and cross-disciplinary issues. He received the 1990, 1998 Best Paper Awards, the 1991 Achievement Award, 1999 Electronics Society Award from IEICE Japan, and the 1998 Fujio Frontier Award, the 2002 and 2009 Best Paper Award, and 2013 Achievement award from ITE Japan. He received the IBM Japan Science Prize in 2002. He is on Editorial Boards of ACM TOMM. He served as the Editor-in-Chief of Journal of ITE Japan, an Associate Editor of IEEE TIP, IEEE TCSVT, IEEE TMM, and IEEE MultiMedia. He has served a number of international and domestic conferences; he was a General Co-Chair of ACM Multimedia 2012 and ICMR2018. He is a Fellow of IEEE, IEICE, ITE and a member of Science Council of Japan.

Dr. Ross Cutler, Microsoft Corporation

Title: Photorealistic Avatars for Video Conferencing

Abstract: Video conferencing has become widely used, enabling millions of people to connect every day either for work or personal communication. However, video conferencing as currently implemented does not meet the experience of face-to-face meetings yet. We will first describe the fundamental issues of video conferencing systems and describe how their implementation effects meeting effectiveness, inclusiveness, trust, empathy, participation, and fatigue. We will show how a new type of video conferencing experience using photorealistic avatars can achieve meetings can address these issues, ultimately providing remote meetings that are as good or better than face to face meetings. We describe a photorealistic avatar test framework that measures avatar realism, trust, comfortableness using and interacting with, formality, and appropriateness for work. We share how the latest avatar technologies rank using these metrics and give insights what is needed to fully replace 2D video for videoconferencing.

Bio: Ross Cutler is a Distinguished Engineer / VP at Microsoft where he manages a team of applied scientists and software engineers that build AI technology to improve the audio/video performance and meetings capabilities for Teams and Skype. He has been with Microsoft since 2000, joining as a researcher in Microsoft Research, but has spent most of his time in the product divisions shipping real-time communication software and hardware used by 300M+ customers. He has published >80 academic papers and has >100 granted patents in the areas of computer vision, speech enhancement, machine learning, optics, and acoustics. Ross received his Ph.D. in Computer Science (2000) in the area of computer vision from the University of Maryland, College Park.

Dr. Elena Alshina, Huawei Technologies

Title: AI coding reality and perspectives (PDF)

Abstract: During last decade the number of research papers devoted image and video coding grows exponentially. JPEG almost completed standardization project for end-to-end learnable image codec. Do we have a chance to see JPEG AI in products in nearest future? What are the challenges on a way to the market? End-to-end AI video codec research progresses rapidly. JVET combines AI coding tools with classical codec. What AI can do for video coding which classical codec cannot do? Computational complexity and power consumption restriction for video encoding and decoding are much stricter compared to image coding. Do we have a chance to see AI video coding tools or even end-to-end AI codec in products in (let’s say) next ten year?

Bio: Dr. Elena Alshina is Chief Video Scientist, Audiovisual Technology Lab Director, Media Codec technology Lab Director in Huawei Technologies. She got Master degree in Physics from Moscow State University in 1995 and PhD in Computer Science and Mathematical Modelling from Russian Academy of Science in 1998. Her field of interest includes but not limited to Mathematical modelling; Video and Image compression and processing, neural network-based algorithms development. Though her carrier Elena was working both in academia and industry: Senior Researcher Russian Academy Of Science (Institute for mathematical Modelling) 1998~2006, Associate Professor National Research University of Electronic Technology, Principal Engineer Samsung Electronics 2006 (Moscow), 2007~2018 (Suwon/Seoul), Chief Video Scientist Huawei Technologies (Munich) 2018-present. Elena is an active participant of video codec standardization (HEVC/H.265, VVC/H.266 projects). Along with Prof. João Ascenso she is JPEG AI standardization project chair and editor.

Programs

Keynote Speech

Professor David Bull, University of Bristol

Professor Kiyoharu Aizawa, University of Tokyo

Dr. Ross Cutler, Microsoft Corporation

Dr. Elena Alshina, Huawei Technologies