Quentin Fournier


Email | LinkedIn | GitHub

I am a Research Fellow at Mila - Quebec Artificial Intelligence Institute.

My current research investigates Transformer language models for drug discovery. Additionally, I am passionate about teaching and have taught the graduate Data Mining course at Polytechnique Montréal.

I received my Ph.D. from Polytechnique Montréal, working with Daniel Aloise and Michel R. Dagenais on the use of Transformers to detect anomalies in Linux kernel traces.


Reasearch Fellow at Mila

Lead a thematic lab on ML for drug discovery.

Consultant as Senior Machine Learning Scientist at Amgen

Design novel ML solutions for drug discovery.

Postdoctoral Fellow at Mila, University of Montréal

Advised by Sarath Chandar and Irina Rish.

Computer Science and Operations Research Department.

Lecturer at Polytechnique Montréal

INF8111 – Data Mining (Fall 2019, Summer 2020, Fall 2020, Fall 2022) (graduate-level)


This graduate-level course is a comprehensive introduction to data mining that covers data munging, machine learning algorithms, mining of graphs and streams, and big data.

Teacher Assistant – Polytechnique Montréal

INF8111 – Data Mining (Summer 2021, Fall 2021) (graduate-level)


This graduate-level course is a comprehensive introduction to data mining that covers data munging, machine learning algorithms, mining of graphs and streams, and big data.

INF8215 – Artificial Intelligence: Methods and Algorithms (Fall 2018) (graduate-level)


This graduate-level course is a comprehensive introduction to artificial intelligence that covers local search, A*, constraint satisfaction problems, supervised and unsupervised learning, as well as reinforcement learning.

Research and Development Internship at IT Link

Supervised by Nicolas Ménard and Christian Raymond

Research Internship at Institut de Recherche en Informatique et Système Aléatoire (IRISA)

Supervised by Christian Raymond

Publications and Research Projects

Predicting the Impact of Model Expansion through the Minima Manifold: A Loss Landscape Perspective

Pranshu Malviya, Jerry Huang, Quentin Fournier, and Sarath Chandar


The optimal model for a given task is often challenging to determine, requiring training multiple models from scratch which becomes prohibitive as dataset and model sizes grow. A more efficient alternative is to reuse smaller pre-trained models by expanding them, however, this is not widely adopted as how this impacts training dynamics remains poorly understood. While prior works have introduced statistics to measure these effects, they remain flawed. To rectify this, we offer a new approach for understanding and quantifying the impact of expansion through the lens of the loss landscape, which has been shown to contain a manifold of linearly connected minima. Building on this new perspective, we propose a metric to study the impact of expansion by estimating the size of the manifold. Experimental results show a clear relationship between gains in performance and manifold size, enabling the comparison of candidate models and presenting a first step towards expanding models more reliably based on geometric properties of the loss landscape.

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, and Sarath Chandar

Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quantity and quality of data, the alignment method, and the adapter rank. However, there has not yet been an extensive study of their effect on downstream performance. To address this gap, we conduct an in-depth investigation of the impact of popular choices for three crucial axes: (i) the alignment dataset (HH-RLHF and BeaverTails), (ii) the alignment technique (SFT and DPO), and (iii) the model (LLaMA-1, Vicuna-v1.3, Mistral-7b, and Mistral-7b-Instruct). Our extensive setup spanning over 300 experiments reveals consistent trends and unexpected findings. We observe how more informative data helps with preference alignment, cases where supervised fine-tuning outperforms preference optimization, and how aligning to a distinct preference boosts performance on downstream tasks. Through our in-depth analyses, we put forward key guidelines to help researchers perform more effective parameter-efficient LLM alignment.

A Practical Survey on Faster and Lighter Transformers

Quentin Fournier, Gaétan Marceau Caron, and Daniel Aloise


Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models' efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer's limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for the deep learning community to determine which methods to apply in practice to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make the Transformer faster and lighter and by providing a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions.

On Improving Deep Learning Trace Analysis With System Call Arguments

Quentin Fournier, Daniel Aloise, Seyed Vahid Azhari, and François Tetreault

View on GitHub arXiv

Kernel traces are sequences of low-level events comprising a name and multiple arguments including a timestamp, a process id, and a return value, depending on the event. Their analysis helps uncover intrusions, identify bugs, and find latency causes. However, their effectiveness is hindered by omitting the event arguments. To remedy this limitation, we introduce a general approach to learn a representation of the event names along with their arguments using both embedding and encoding. The proposed method is readily applicable to most neural networks and is task-agnostic. The benefit is quantified by conducting an ablation study on three groups of arguments: call-related, process-related, and time-related. Experiments were conducted on a novel web request dataset and validated on a second dataset collected on pre-production servers by Ciena, our partnering company. By leveraging additional information, we were able to increase the performance of two widely-used neural networks, an LSTM and a Transformer, by up to 11.3% on two unsupervised language modelling tasks. Such tasks may be used to detect anomalies, pre-train neural networks to improve their performance, and extract a contextual representation of the events.

Depgraph: Localizing Performance Bottlenecks in Multi-Core Applications Using Waiting Dependency Graphs and Software Tracing

Naser Ezzati-Jivan, Quentin Fournier, Michel R. Dagenais, and Abdelwahab Hamou-Lhadj


This paper addresses the challenge of understanding the waiting dependencies between the threads and hardware resources required to complete a task. The objective is to improve software performance by detecting the underlying bottlenecks caused by system-level blocking dependencies. In this paper, we use a system level tracing approach to extract a Waiting Dependency Graph that shows the breakdown of a task exe- cution among all the interleaving threads and resources. The method allows developers and system administrators to quickly discover how the total execution time is divided among its interacting threads and resources. Ultimately, the method helps detecting bottlenecks and highlighting their possible causes. Our experiments show the effectiveness of the proposed approach in several industry-level use cases. Three performance anomalies are analysed and explained using the proposed approach. Evaluating the method efficiency reveals that the imposed overhead never exceeds 10.1%, therefore making it suitable for in-production environments.

Automatic Cause Detection of Performance Problems in Web Applications

Quentin Fournier, Naser Ezzati-jivan, Daniel Aloise, and Michel R. Dagenais

View on GitHub Open Notebook arXiv

The execution of similar units can be compared by their internal behaviors to determine the causes of their potential performance issues. For instance, by examining the internal behaviors of different fast or slow web requests more closely and by clustering and comparing their internal executions, one can determine what causes some requests to run slowly or behave in unexpected ways. In this paper, we propose a method of extracting the internal behavior of web requests as well as introduce a pipeline that detects performance issues in web requests and provides insights into their root causes. First, low-level and fine-grained information regarding each request is gathered by tracing both the user space and the kernel space. Second, further information is extracted and fed into an outlier detector. Finally, these outliers are then clustered by their behavior, and each group is analyzed separately. Experiments revealed that this pipeline is indeed able to detect slow web requests and provide additional insights into their true root causes. Notably, we were able to identify a real PHP cache contention using the proposed approach.

Empirical Comparison Between Autoencoders and Traditional Dimensionality Reduction Methods

Quentin Fournier and Daniel Aloise

View on GitHub arXiv

In order to process efficiently ever-higher dimensional data such as images, sentences, or audio recordings, one needs to find a proper way to reduce the dimensionality of such data. In this regard, SVD-based methods including PCA and Isomap have been extensively used. Recently, a neural network alternative called autoencoder has been proposed and is often preferred for its higher flexibility. This work aims to show that PCA is still a relevant technique for dimensionality reduction in the context of classification. To this purpose, we evaluated the performance of PCA compared to Isomap, a deep autoencoder, and a variational autoencoder. Experiments were conducted on three commonly used image datasets: MNIST, Fashion-MNIST, and CIFAR-10. The four different dimensionality reduction techniques were separately employed on each dataset to project data into a low-dimensional space. Then a k-NN classifier was trained on each projection with a cross-validated random search over the number of neighbours. Interestingly, our experiments revealed that k-NN achieved comparable accuracy on PCA and both autoencoders projections provided a big enough dimension. However, PCA computation time was two orders of magnitude faster than its neural network counterparts.

Variational Autoencoder as a Justification for Naive Bayes, Linear and Quadratic Discriminant Analysis

Quentin Fournier and Charafeddine Talal

View on GitHub Open Notebook PDF

Classical machine learning methods such as naive Bayes, linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) have been applied with success to many different problems. However, all the above methods make assumptions on the data distribution. Although those methods tend to work well even when their assumptions are not met, one could look for a way to systematically justify their use. As part of our project, we propose to learn a projection of the data that will verify all the assumptions made by naive Bayes, LDA and QDA. In order to do so, we used an unsupervised probabilistic neural network called a variational autoencoder. Such a model can learn a projection which tends to follow a normal distribution N(0,I). This allows us to evaluate the impact of violating - or respecting - the assumptions made by the three classifiers. When applied on a real data set of credit card fraud detection, we observed a significant improvement for QDA and naive Bayes. More specifically, for a small trade-off in precision, the recall rate of both methods increase 7 fold. However, LDA performs only slightly better on the learn projection than on the original space.

This project has been conducted as part of an assignment for MTH6312: Méthodes statistiques d'apprentissage at Polytechnique Montréal (Winter, 2018).

Neural Networks as an Alternative to I-Vectors for Speaker Verification

Quentin Fournier and Christian Raymond


This project has been conducted as part of a research internship at IRISA (Summer, 2017).

As part of an internship at the IRISA, I investigated the use of neural networks for speaker verification. In particular, I studied data projections through deep encoders as an alternative to i-vector embedding.