Internship position

Title: Automatic labelling of webpages for historical research.

Keywords: Digital humanities, Machine Learning, Complex networks

Duration: 6 months between March and August 2024.

Application deadline: March 1st, 2024.

Location: Centre de Physique Théorique, Marseille, France.

Topic: In collaboration with Sophie Gebeil, MCF and researcher at the TELEMMe Laboratory (CNRS UMR 7303) and Patrice Bellot, Professor at the LiS (Informatic & Systems Laboratory), we offer an interdisciplinary internship between computer science, historical research and complex networks. Following previous results, the aim is to set a data processing pipeline able to automatically label elements from a corpus of web pages from institutinal archives, in order to enable their exploration for researchers in history. The pipeline will consist in two analyses: an identification of topics through topic modelling, and an extraction of relevant entities through named entities recognition. Once labelled, the data will be further categorised using network methods, to identify communities in topics, entities and documents.

The first goal is to ensure that the data processing method can be deployed for large corpora, and that its results are stable, reproductible and of interest for research purposes. The intern is expected to work with historians to adapt the process to research questions and help navigating the documents. A secondary aspect of the internship can be the set up of a search engine based on the results of the automatic labelling.

The test corpus is about the 1983 “Marche pour l’égalité et contre le racisme”. Application to other, larger corpora is possible.


  • Master student in Computer Science, Computational social science.
  • Very good proficiency in Python.
  • Knowledge in HTML, topic modelling and networks would be useful.
  • Interest in digital humanities and interdisciplinary work is a plus.

If you are interested, send a CV at:

mathieu.genois [at]