Scientific Research

Research Activities

My research work was conducted at the L3i laboratory (in the IC team) of the Computer Science Department at La Rochelle University between December 2019 and December 2022. My research primarily falls within the field of analysis and understanding of administrative documents. These documents can be either paper or digital documents produced by large public or private institutions, encompassing various types of highly heterogeneous content (structured and unstructured). Indeed, these contents often take various forms, such as charts in technical reports, diagrams in scientific articles, and graphical designs in bulletins. In order to make decisions on topics of interest such as science, industry, health, etc., humans can efficiently process the visual and textual information contained in these documents. However, manually understanding and analyzing large amounts of data from documents typically takes time and is costly.

In general, document data is often presented in complex layouts due to the different ways each document is organized. Unlike general images of natural scenes, documents are very challenging due to their visual structural properties and heterogeneous textual content. In such conditions, the development of computer tools capable of automatically understanding and extracting precise structured information from a wide variety of documents remains crucial, in a way that leads to significant administrative and/or commercial applications.

Today, there are several applications used for automatically understanding data in administrative and commercial documents, such as document classification, content-based document retrieval, and few-shot document classification. Therefore, the key to automated document understanding lies in the efficient integration of signals from multiple data modalities. Since documents are inherently multimodal, it’s important to leverage the multi-modal information from language and vision. Unlike other data formats such as images or their raw text extracted through optical character recognition (OCR), documents combine visual and linguistic information, complemented by document layout. Additionally, from a practical standpoint, many tasks related to document understanding are data scarce. A model that can learn from unlabeled documents (i.e., pre-training) and fine-tune the model for specific document applications is preferred over one that requires fully annotated training data (i.e., trained in a fully supervised learning mode).

The focus of my current research work is in the domain of understanding and analyzing images of administrative documents (emails, invoices, advertisements, articles, reports, etc.), which has been widely adopted in various document image processing applications. My research primarily concentrates on inter-modal interactions between visual and textual information in document images, aiming at designing an efficient learning environment. Indeed, the process of designing such systems involves studying the benefits of inter-modal interactions in multi-modal learning. Such systems encourage inter-modal learning between the visual and textual features of visual and language modalities to enhance their distribution in a common representation space. Furthermore, the models developed result from an iterative process of analysis and synthesis between existing theories and our conducted studies. The essence of my current research is to study inter-modal learning for contextual understanding of document components through language and vision. The main idea is to leverage the multi-modal information in document images in a common semantic space. The principle is to automatically extract information from the content presented in information systems (document scans, structured and unstructured information), understand the interactions between visual and textual data, reorganize the search space, and ultimately find a common semantic space to perform the required applications.

Overall, my research project focuses on advancing inter-modal learning research and contributes on four fronts: (i) proposing an inter-modal approach with a deep bidirectional neural network capable of simultaneously learning textual content and visual information from scanned document images. The goal is to jointly exploit language and vision information in a common semantic representation space to automatically make predictions about multimodal documents (i.e., the topics they address); (ii) studying competitive strategies to address inter-modal document classification tasks, few-shot document classification, and content-based document retrieval; (iii) addressing data-related issues such as learning when data is not annotated, by proposing a network that learns generic representations from a collection of unlabeled documents; and finally, (iv) leveraging learning parameters when data only contains a few examples.

Scientific Publications

You can also find my articles on my Google Scholar profile.

[International Journals]

[J1] Bakkali, S., Biswas, S., Ming, Z., Coustaty, M., Rusiñol, M., Terrades, O. R., & Lladós, J. Transferdoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language. Available at SSRN 4545314. (SJR rang Q1, IF 8.518)

Access the paper here

[J2] Bakkali, S., Ming, Z., Coustaty, M., Rusiñol, M., & Terrades, O. R. (2023). VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification. Pattern Recognition, 139, 109419. (SJR rang Q1, IF 8.518)

Access the paper here

[J3] Bakkali, S., Ming, Z., Coustaty, M., & Rusiñol, M. (2021). EAML: ensemble self-attention-based mutual learning network for document image classification. International Journal on Document Analysis and Recognition (IJDAR), 24(3), 251-268. (SJR rang Q1, IF 3.870)

Access the paper here

[International Conferences & Workshops]

[C1] Bakkali, S., Ming, Z., Coustaty, M., & Rusiñol, M. (2020). Visual and textual deep feature fusion for document image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (pp. 562-563).

Access the paper here

[C2] Bakkali, S., Ming, Z., Coustaty, M., & Rusiñol, M. (2020, October). Cross-modal deep networks for document image classification. In 2020 IEEE International Conference on Image Processing (ICIP) (pp. 2556-2560). IEEE. (Core rang B)

Access the paper here

[C3] Bakkali, S., Luqman, M. M., Ming, Z., & Burie, J. C. (2019, September). Face detection in camera captured images of identity documents under challenging conditions. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 4, pp. 55-60). IEEE.

Access the paper here

[Thesis]

Bakkali, S.. Multimodal Document Understanding with Unified Vision and Language Cross-Modal Learning. Laboratoire L3i de La Rochelle Université, Thesis, 2022.