Ozan Caglayan

Dr, Senior Research Engineer

Imperial College London

contex.ai

Biography

I am a computer scientist from Turkey, working on problems related to natural language processing. I received my PhD in 2019 from the Informatics Laboratory (LIUM) of Le Mans University. During summer 2018, I visited Johns Hopkins University as a graduate student in the Grounded Sequence to Sequence Transduction Team, as part of the JSALT workshop.

I worked as a senior research engineer at CONTEX.AI in 2022, a start-up company trying to solve the immense problem of online toxic content moderation, by leveraging contextual and multimodal machine learning techniques. The company got later acquired by Epic Games in 2023 and since then, I am working as a Technical Lead within the Machine Learning Solutions team at Epic Games UK.

Between 2012-2017, I worked as a research & teaching assistant in the Computer Engineering Dept. of Galatasaray University in Istanbul, Turkey where I lectured practical sessions for undergraduate courses such as Algorithms, C Programming, Microprocessors and Operating Systems. I obtained my MSc. degree from Galatasaray University as well where I developed a simple embedded system for a Brain-computer Interface application. You can access my MSc. thesis from here.

Between 2007-2012, I took part in the Pardus Linux project within the Scientific & Technological Research Council of Turkey (TUBITAK) as a full-time Linux developer.

Interests

Deep Neural Nets
Natural Language Processing
Speech Processing
Linguistics
Open-Source & Linux
DIY Electronics & Maker
Photography
Woodworking

Education

PhD in Machine Translation, 2019
Le Mans University
MSc. Computer Engineering, 2014
Galatasaray University, Turkey
BSc. Computer Engineering, 2008
Galatasaray University, Turkey
Erasmus Exchange Programme, 2007
Université Joseph Fourier, France
High School, 2004
Lycée Français Saint Joseph, Istanbul

Experience

Senior Research Engineer & Technical Lead

CONTEX.AI (Epic Games since 2023)

Jan 2022 – Present London, United Kingdom

Research and engineering of machine learning models for content moderation, DevOps, MLOps, Infrastructure Engineering.

Research Associate

Imperial College London

Jul 2019 – Dec 2021 London, United Kingdom

Research on vision & language processing related topics such as image captioning, multimodal machine translation.
Teaching language modelling and machine translation of the NLP course.
Co-supervision of MSc. students.

Visiting Researcher

Center for Language & Speech Processing, Johns Hopkins University

Jun 2018 – Aug 2018 Baltimore, USA

Participated to the Fifth Frederick Jelinek Memorial Summer Workshop (JSALT)

PhD Student, Researcher

LIUM - Le Mans Université

Oct 2017 – Jun 2019 Le Mans, France

Teaching & Research Assistant

Computer Engineering Dept. / Galatasaray University

Oct 2012 – Mar 2017 Istanbul, Turkey

Academic research contributing towards the PhD topic
MSc. related research projects on Brain-computer Interfaces
Teaching assistant for Parallel computing, C programming, Algorithms, Operating Systems, Microprocessors.

Researcher

Scientific and Technological Research Council of Turkey (TUBITAK)

Sep 2007 – Jan 2012 Gebze, Kocaeli

Open-source Linux developer under the Pardus project

Intern at Pardus Project

Scientific and Technological Research Council of Turkey (TUBITAK)

Jun 2007 – Aug 2007 Gebze, Kocaeli

Intern at Optoelectronics Lab

Scientific and Technological Research Council of Turkey (TUBITAK)

Jun 2006 – Jul 2006 Gebze, Kocaeli

Featured Publications

Faidon Mitzalis, Ozan Caglayan, Pranava Madhyastha, Lucia Specia

August 2021 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP): Main Volume

BERTGen: Multi-task Generation through BERT

We present BERTGen, a novel, generative, decoder-only model which extends BERT by fusing multimodal and multilingual pre-trained models VL-BERT and M-BERT, respectively. BERTGen is auto-regressively trained for language generation tasks, namely image captioning, machine translation and multimodal machine translation, under a multi-task setting. With a comprehensive set of evaluations, we show that BERTGen outperforms many strong baselines across the tasks explored. We also show BERTGen’s ability for zero-shot language generation, where it exhibits competitive performance to supervised counterparts. Finally, we conduct ablation studies which demonstrate that BERTGen substantially benefits from multi-tasking and effectively transfers relevant inductive biases from the pre-trained models.

Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, Lucia Specia

April 2021 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Cross-lingual Visual Pre-training for Multimodal Machine Translation

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.

Ozan Caglayan, Pranava Madhyastha, Lucia Specia

December 2020 Proceedings of the 28th International Conference on Computational Linguistics

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models by demonstrating important failure cases on multiple datasets, language pairs and tasks. Our experiments show that metrics (i) usually prefer system outputs to human-authored texts, (ii) can be insensitive to correct translations of rare words, (iii) can yield surprisingly high scores when given a single sentence as system output for the entire test set.

Ozan Caglayan, Julia Ive, Veneta Haralampieva, Pranava Madhyastha, Loı̈c Barrault, Lucia Specia

November 2020 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Simultaneous Machine Translation with Visual Context

Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus has to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addition of visual information can compensate for the missing source context. To this end, we analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information are much better than commonly used global features, reaching up to 3 BLEU points improvement under low latency scenarios. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order, such as adjective-noun placement between English and French.

Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann

January 2020 Machine Translation

Multimodal machine translation through visuals and speech

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loı̈c Barrault

June 2019 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Probing the Need for Visual Context in Multimodal Machine Translation

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Löic Barrault, Lucia Specia, Florian Metze

December 2018 Proceedings of the Workshop on Visually Grounded Interaction and Language (NeurIPS 2018)

How2: A Large-scale Dataset For Multimodal Language Understanding

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.

Recent Publications

Quickly discover relevant content by filtering publications.

Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation

This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to …

Julia Ive, Andy Mingren Li, Yishu Miao, Ozan Caglayan, Pranava Madhyastha, Lucia Specia

Grounded Sequence to Sequence Transduction

Speech recognition and machine translation have made major progress over the past decades, providing practical systems to map one …

L. Specia, L. Barrault, O. Caglayan, A. Duarte, D. Elliott, S. Gella, N. Holzenberger, C. Lala, S. J. Lee, J. Libovicky, P. Madhyastha, F. Metze, K. Mulligan, A. Ostapenko, S. Palaskar, R. Sanabria, J. Wang, R. Arora

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation …

O. Caglayan, R. Sanabria, S. Palaskar, L. Barraul, F. Metze

Transformer-based Cascaded Multimodal Speech Translation

This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 …

Zixiu Wu, Ozan Caglayan, Julia Ive, Josiah Wang, Lucia Specia

LIUM-CVC Submissions for WMT18 Multimodal Translation Task

This paper describes the multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal …

Ozan Caglayan, Adrien Bardet, Fethi Bougares, Loı̈c Barrault, Kai Wang, Marc Masana, Luis Herranz, Joost van de Weijer

Sustainable Computational Science: The ReScience Initiative

Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results; …

Nicolas P Rougier, Konrad Hinsen, Frédéric Alexandre, Thomas Arildsen, Lorena A Barba, Fabien CY Benureau, C Titus Brown, Pierre De Buyl, Ozan Caglayan, Andrew P Davison, others

LIUM Machine Translation Systems for WMT17 News Translation Task

This paper describes LIUM submissions to WMT17 News Translation Task for English-German, English-Turkish, English-Czech and …

Mercedes Garcı́a-Mart\ńez, Ozan Caglayan, Walid Aransa, Adrien Bardet, Fethi Bougares, Loı̈c Barrault

LIUM-CVC Submissions for WMT17 Multimodal Translation Task

This paper describes the monomodal and multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT17 Shared Task on …

Ozan Caglayan, Walid Aransa, Adrien Bardet, Mercedes Garcı́a-Mart\ńez, Fethi Bougares, Loı̈c Barrault, Marc Masana, Luis Herranz, Joost van de Weijer

NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems

In this paper, we present nmtpy, a flexible Python toolkit based on Theano for training Neural Machine Translation and other neural …

Ozan Caglayan, Mercedes Garcı́a-Mart\ńez, Adrien Bardet, Walid Aransa, Fethi Bougares, Loı̈c Barrault

Multimodal Attention for Neural Machine Translation

The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source …

Ozan Caglayan, Loı̈c Barrault, Fethi Bougares