Online education is one of the wealthiest industries in the world. The relevance of this sector has increased due to the COVID-19 emergency, forcing nations to convert their education systems towards online environments quickly. Despite the benefits of distance learning, students enrolled in online degree programs have a higher chance of dropping out than those attending a conventional classroom environment. Being able to detect student withdrawals earlyis fundamental to build the next generation learning environment. In machine learning, this is known as the student dropout prediction (SDP) problem. In this tutorial, intermediate-level academicians, industry practitioners, and institutional officers will learn existing works and current progress within this particular domain. We provide a mathematical formalisation to the SDP problem, and we discuss in a comprehensive review the most useful aspects to consider for this specific domain: definition of the prediction problem, input modelling, adopted prediction technique, evaluation framework, standard benchmark datasets, and privacy concerns.
Online courses and e-degrees, although present since the mid-1990 (link), have received enormous attention only in the last decade. Moreover, the new Coronavirus disease (COVID-19) outbreak forced many nations (e.g. Italy, the US, and other countries) to massively push their education system towards an online environment. Academics now are also looking at the crisis as an opportunity for universities to adopt digital technologies for teaching more broadly. But they will have to understand what possible ways of evaluating and effectively teaching will be in this new scenario.
The depicted overview, in conjunction with the utility and ubiquitous access to the educational platforms of online courses, entails a vast amount of enrolments. Nevertheless, a high enrolment rate usually translates into a significant dropout (or withdrawal) rate of students (40-80% of online students drop out ).
Student dropout prediction (SDP) consists of modelling and forecasting student behaviour when interacting with e-learning platforms. It is a significant phenomenon that has repercussions on online institutions, the involved students and professors. Early approaches tended to perform manual analytic examinations to devise retention strategies. Recent research has adopted automated policies to thoroughly exploit the advantages of student activities (hereafter e-tivities) in the e-platforms and identify at-risk students. These approaches include machine learning and deep learning techniques to predict the student dropout status.
With the growing popularity of Massive Open Online Courses (MOOCs) and online platforms - e.g. edX and Coursera - the research has concentrated towards time series methods taking into consideration the faster pace of interactive and telematic courses. Therefore, being able to cope with the trend shifting of student interactions with the course platforms in real-time has become of paramount importance. For example, by looking at the number of e-tivities that a student performs in several course phases, one can determine whether they persist/drop out in the next stage .
In this tutorial, we comprehensively overview the SDP problem in the literature. We provide mathematical formalisation to the different definitions proposed, and we introduce simple ,,,,,,,,,,,,,, and complex ,,,, predictive methods adhering to the following:
This tutorial presents a comprehensive overview of student dropout prediction from online courses. In detail, this tutorial is divided in the following sections:
Theory and SDP definitions:
Depending on the chosen dropout definitions mentioned above, we can exploit either a plain input modelling technique or sequence labelling (i.e. time-series  and network-based inspired by ,). Both modelling approaches allow resolving the SDP problem in two separate ways. The first copes with static data, meanwhile, the second handles time-dependent information (i.e. e-tivities). We will cover both, but this tutorial focuses on time-relevant features.
Plain modelisation: Static features such as student background information and previous study-related data (e.g. admission test scores, high school GPA) offer a robust input mechanism for the prediction strategies.
Sequence labelling: Sequence labelling consists of shaping the raw data logs into discrete temporal series which include student e-tivities. By relying on discrete-time temporal series , temporal series and networks are two input modelisation techniques to handle sequential data as is the case with e-tivities.
The various prediction strategies utilised for solving the SDP problem are classified into two major categories: surface machine learning and deep learning approaches. The selection of the prediction technique is strictly related to the adopted modelisation methodology. In other words, models based on sequence labelling are ,,,,,,,,,,; whereas, those that rely on plain modelisation are ,,,,,,,.
Machine Learning Methods:
Deep Learning Methods
We enlist the different evaluation measures used in the literature while acknowledging the unbalanced persister and dropout ratios. Next, we discuss the scarcity of standard benchmarking datasets that leads to a lack of comparisons among the state-of-the-art methodologies. We give details on how researchers can proceed to publish their data being compliant to several privacy regulations such as GDPR in the EU. Finally, we present the open issues that are important but have not been well addressed in recent studies which can inspire future directions in this research field. Here we cover the generalisation issues and passing from a single online course (or MOOC) to several interrelated courses forming and online degrees.
The community of Information and Knowledge Management - historically interested in Analytics and Machine Learning, and in Neural Information and Knowledge Processing - will benefit by this tutorial, since, a rising interest to the e-learning area was expressed in the last decade ,. The need to increase the knowledge on topics discussed by the tutorial will be pushed much more in the future by the rising demand (and implementation) of new systems dedicated to distance learning . In this scenario, successfully detecting Student Drop Out will be a fundamental task of the next generation distance learning environment, by enabling the improvement of sustainability, transparency and fairness, from both the institutions' and students' perspectives. For these reasons, we think that the proposed tutorial will be an optimal fit with the central theme - Data and knowledge for the next generation: sustainability, transparency and fairness - of CIKM 2020.
The tutorial is tailored to catalyse the attention of intermediate-level academicians, industry practitioners and institutional officers that want to increase or build their knowledge in the area of Educational Data Mining. More specifically, the tutorial is dedicated to those who have some familiarity with machine learning and basics of time-series modelisation techniques. However, we will not assume that the audience has specific knowledge in the SDP nor the Learning Analytics  fields. Hence, we will provide sufficient details such that the content is understandable, including the complex concepts of the presented deep learning strategies.
At the end of this tutorial, the participants will be able to understand and implement the best practice related to the student dropout prediction. In more detail, the participants will take away the following lesson according to their background:
© Copyright 2020 AIIM-RG - All rights reserved.