Challenges and Solutions to the Student Dropout Prediction Problem in Online Courses

tutorial at ACM Conference on Information & Knowledge Management (CIKM), Galway, October 19th-20th, 2020


Online education is one of the wealthiest industries in the world. The relevance of this sector has increased due to the COVID-19 emergency, forcing nations to convert their education systems towards online environments quickly. Despite the benefits of distance learning, students enrolled in online degree programs have a higher chance of dropping out than those attending a conventional classroom environment. Being able to detect student withdrawals earlyis fundamental to build the next generation learning environment. In machine learning, this is known as the student dropout prediction (SDP) problem. In this tutorial, intermediate-level academicians, industry practitioners, and institutional officers will learn existing works and current progress within this particular domain. We provide a mathematical formalisation to the SDP problem, and we discuss in a comprehensive review the most useful aspects to consider for this specific domain: definition of the prediction problem, input modelling, adopted prediction technique, evaluation framework, standard benchmark datasets, and privacy concerns.

Motivation and Overview

Online courses and e-degrees, although present since the mid-1990 (link), have received enormous attention only in the last decade. Moreover, the new Coronavirus disease (COVID-19) outbreak forced many nations (e.g. Italy, the US, and other countries) to massively push their education system towards an online environment. Academics now are also looking at the crisis as an opportunity for universities to adopt digital technologies for teaching more broadly. But they will have to understand what possible ways of evaluating and effectively teaching will be in this new scenario.

The depicted overview, in conjunction with the utility and ubiquitous access to the educational platforms of online courses, entails a vast amount of enrolments. Nevertheless, a high enrolment rate usually translates into a significant dropout (or withdrawal) rate of students (40-80% of online students drop out [27]).

Student dropout prediction (SDP) consists of modelling and forecasting student behaviour when interacting with e-learning platforms. It is a significant phenomenon that has repercussions on online institutions, the involved students and professors. Early approaches tended to perform manual analytic examinations to devise retention strategies. Recent research has adopted automated policies to thoroughly exploit the advantages of student activities (hereafter e-tivities) in the e-platforms and identify at-risk students. These approaches include machine learning and deep learning techniques to predict the student dropout status.

With the growing popularity of Massive Open Online Courses (MOOCs) and online platforms - e.g. edX and Coursera - the research has concentrated towards time series methods taking into consideration the faster pace of interactive and telematic courses. Therefore, being able to cope with the trend shifting of student interactions with the course platforms in real-time has become of paramount importance. For example, by looking at the number of e-tivities that a student performs in several course phases, one can determine whether they persist/drop out in the next stage [11].

In this tutorial, we comprehensively overview the SDP problem in the literature. We provide mathematical formalisation to the different definitions proposed, and we introduce simple [1],[2],[3],[6],[8],[14],[15],[16],[18],[19],[21],[22],[23],[24],[28] and complex [9],[11],[12],[26],[30] predictive methods adhering to the following:

  • Student dropout definition: The type of dropout definition (i.e. one plain and several recurrent) entails the most suitable input modelling strategies and, as a consequence, the adopted prediction methods.
  • Input modelling: Most works focus on predicting the final dropout decision of students. Two main approaches can be adopted. The first is plain time-irrelevant features modelisation, and the second consists of a phasic view (i.e. weekly in MOOCs, and module-based in online courses) of e-tivities.
  • Underlying machine and deep learning techniques: Past research lines focused on using off-the-shelf machine learning models to predict dropouts: i.e. Naive Bayes, Logistic Regression, Decision Tree, SVM. Recently, deep learning approaches (i.e. RNNs, CNNs and their combination) have gained the attention of the researcher community.
  • Evaluation measures: Due to the disparity between persisting and dropout students, choosing the appropriate evaluation metric becomes a fundamental issue of the SDP. We will present the utility of each measurement (e.g. F-score, precision, recall, accuracy, AUCROC, and AUCPR) and how the literature exploits them to report performances.
  • Datasets: Although lacking in numerosity, standard benchmark datasets are useful when comparing the performances of the different SDP methods. Here, we will present all the available datasets in the literature that can be useful in the evaluation and comparison phase.
  • Privacy concerns: Nowadays, a fundamental organisational aspect in the EDM domain is to cope effectively with privacy regulations, especially in the EU (link), and the institutional policies.

Tutorial Outline

This tutorial presents a comprehensive overview of student dropout prediction from online courses. In detail, this tutorial is divided in the following sections:

Introduction and background to the theory of the SDP problem


  • Discussion about the specific problem of student premature course dropout in online institutions. We will demonstrate the interaction between the principal stakeholders, such as students, teachers, and tutors via actions performed in learning platforms.
  • We will essentially cope with the student e-tivities within an e-learning platform. We will also show the impact that SDP has had on the Educational Data Mining (EDM) research area in the last decade.

Theory and SDP definitions:

  • We will provide conceptual annotations of courses, course phases, and students as a set of e-tivities taking into account the time dimension.
  • The literature considers SDP as a binary plain and recurrent classification problem. Will provide a plain definition for SDP and rely on the three recurrent definitions presented in [11] to keep track of the input modelisation and choice of the prediction model.
  • Discussion about the advantages and disadvantages of the three dropout definitions with the focus on preemptive dropout identifications.

Input modelling techniques

Depending on the chosen dropout definitions mentioned above, we can exploit either a plain input modelling technique or sequence labelling (i.e. time-series [4] and network-based inspired by [5],[31]). Both modelling approaches allow resolving the SDP problem in two separate ways. The first copes with static data, meanwhile, the second handles time-dependent information (i.e. e-tivities). We will cover both, but this tutorial focuses on time-relevant features.

Plain modelisation: Static features such as student background information and previous study-related data (e.g. admission test scores, high school GPA) offer a robust input mechanism for the prediction strategies.

  • Illustration of the conceptual workflow of the transformation of raw data - usually student logs collected by the online institutions - into a plain modelisation format. We will discuss some techniques to flatten time-dependent e-tivity features into vectors of study-related characteristics.
  • Description of the adoption of background and study-related features on the literature.
  • Ad-hoc features are suitable when contextual information is provided [15]. MOOC datasets do not publish such information for privacy concerns.

Sequence labelling: Sequence labelling consists of shaping the raw data logs into discrete temporal series which include student e-tivities. By relying on discrete-time temporal series [4], temporal series and networks are two input modelisation techniques to handle sequential data as is the case with e-tivities.

  • Introduction to the clickstream-based [9],[16],[18],[21],[24],[26],[28],[30] and forum intervention-based [11],[26] schema of sequence labelling with temporal series. We emphasise time-series peculiarities when taking into consideration gaps between course phases, making the sequence labelling more problematic.
  • The underrepresented temporal network [14] approach has been used only as a preliminar investigation for other off-the-shelf machine learning models. The temporal and hierarchical structure of threads, comments, and replies provide graph-related metrics for prediction. We will give two different useful techniques to construct networks based on two assumptions. The first states that every new student discussion on a particular thread θ responds to every other message in θ [5]. The second supposes that every comment and reply of a student is connected with the thread initiator, thus forming star networks [31].

Methods in SDP

The various prediction strategies utilised for solving the SDP problem are classified into two major categories: surface machine learning and deep learning approaches. The selection of the prediction technique is strictly related to the adopted modelisation methodology. In other words, models based on sequence labelling are [6],[9],[11],[16],[18],[21],[22],[24],[26],[28],[30]; whereas, those that rely on plain modelisation are [1],[2],[3],[12],[13],[15],[19],[23].

Machine Learning Methods:

  • Presentation of methods based on logistic regressions [14],[18],[28], SVM [2],[21], Decision Trees [1],[8],[24], and Naive Bayes classifiers [13],[22].
  • Ensemble methods use weak classifiers (i.e. AdaBoost) and make predictions according to a particular consensus function (e.g. majority voting) to obtain an overall robust classifier. Ensemble methods in the SDP literature include random forests [15],[16], AdaBoost [3],[19] and custom types of ensembles [23].
  • One-hidden layer neural networks (NNs) have been utilised in an approach relying on extreme learning machine (ELM). Transforming and passing a decision tree classifier through an enhancement and mapping layers produce a sparsely connected one-hidden layer NN that can be used for prediction [6].

Deep Learning Methods

  • Past student behaviour in terms of e-tivity engagement influences current behaviour. Therefore, classic RNNs and recurrent networks with LSTM cells [11] have been adopted in vaiours ways. LSTM-based autoencoders [9] ahve been utilised to construct a hidden spatial representation of the input which is then fed into a feed forward dense layer which predicts the outcome.
  • The usage of attention mechanism on contextual data, feature augmentation and CNNs [12] demonstrates to have an impact on machine learning baselines as well as deep feed-forward neural networks. We also explore combinations of CNNs - without pooling - and several dense layers which preserve all the feature information [26].
  • The mixture of CNN and RNN layers into a single end-to-end network boosts performances [30].

Evaluation measures, dataset publication and open challenges

We enlist the different evaluation measures used in the literature while acknowledging the unbalanced persister and dropout ratios. Next, we discuss the scarcity of standard benchmarking datasets that leads to a lack of comparisons among the state-of-the-art methodologies. We give details on how researchers can proceed to publish their data being compliant to several privacy regulations such as GDPR in the EU. Finally, we present the open issues that are important but have not been well addressed in recent studies which can inspire future directions in this research field. Here we cover the generalisation issues and passing from a single online course (or MOOC) to several interrelated courses forming and online degrees.

Relevance, Audience, and Benefits

The community of Information and Knowledge Management - historically interested in Analytics and Machine Learning, and in Neural Information and Knowledge Processing - will benefit by this tutorial, since, a rising interest to the e-learning area was expressed in the last decade [7],[29]. The need to increase the knowledge on topics discussed by the tutorial will be pushed much more in the future by the rising demand (and implementation) of new systems dedicated to distance learning [10]. In this scenario, successfully detecting Student Drop Out will be a fundamental task of the next generation distance learning environment, by enabling the improvement of sustainability, transparency and fairness, from both the institutions' and students' perspectives. For these reasons, we think that the proposed tutorial will be an optimal fit with the central theme - Data and knowledge for the next generation: sustainability, transparency and fairness - of CIKM 2020.

The tutorial is tailored to catalyse the attention of intermediate-level academicians, industry practitioners and institutional officers that want to increase or build their knowledge in the area of Educational Data Mining. More specifically, the tutorial is dedicated to those who have some familiarity with machine learning and basics of time-series modelisation techniques. However, we will not assume that the audience has specific knowledge in the SDP nor the Learning Analytics [20] fields. Hence, we will provide sufficient details such that the content is understandable, including the complex concepts of the presented deep learning strategies.

At the end of this tutorial, the participants will be able to understand and implement the best practice related to the student dropout prediction. In more detail, the participants will take away the following lesson according to their background:

  • Student in IKM: You will learn how the SDP problem is formulated and how to model it according to time-series and plain modelisation. You will discover the most promising approaches in machine learning and deep learning used for prediction. We will provide you with a thorough explanation of the diversity of evaluation metrics used in the literature and comment on their usefulness. You will learn which are the available datasets in the literature and how they can be useful for evaluation and comparison.
  • Academics: You will obtain details on the formalisation of the dropout problem which provides the building-blocks for further research. You will take away the intrinsic difficulties that temporal series and temporal gaps portray when dealing with dropout prediction via sequence labelling. You will understand the importance of choosing the most suitable evaluation measure when coping with problems such as unbalanced data, feature sparsity, and temporal gaps. Finally, we will provide you with a simple yet sufficient privacy-compliant schema that obfuscates data if you want to make your dataset publicly available.
  • Machine Learning Engineer: You will see familiar off-the-shelf machine learning models built-in in most packages of modern programming languages. However, we will also introduce exciting ensemble models and sophisticated deep learning strategies (i.e. combination of convolutional and recurrent neural networks, and the incorporation of attention mechanisms). Furthermore, we treat the importance of time-series modelling according to the choice of the dropout definition. These concepts will leave you with general baggage of time-series analytics and its innate problematics.
  • Education Manager: You will understand which are the key and critical aspects related to the SDP, the involved technologies and the expertise needed to implement this task into your organisation. The gained knowledge will allow you, then, to properly decide which is the best approach that your organisation needs to implement to tackle the student dropout effectively. Having this broad overview will permit you to understand which are the benefits and the drawbacks of enabling one or the other kind of e-learning platform. Thus, you will produce more thoughtful planning and management of resources of your institution.


  1. Qasem A. Al-Radaideh, Emad M. Al-Shawakfa, and Mustafa I. Al-Najjar. 2006. Mining student data using decision trees. In International Arab Conference onInformation Technology (ACIT’2006), Yarmouk University, Jordan. 1–5.
  2. Bussaba Amnueypornsakul, Suma Bhat, and Phakpoom Chinprutthiwong. 2014. Predicting attrition along the way: The UIUC model. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. Association for Computational Linguistics, Doha, Qatar, 55–59.
  3. Johannes Berens, Kerstin Schneider, Simon Görtz, Simon Oster, and Julian Burghoff. 2018. Early Detection of Students at Risk–Predicting Student DropoutsUsing Administrative Student Data and Machine Learning Methods. CESifo Working Paper, Munich, German
  4. Peter J. Brockwell, Richard A. Davis, and Matthew V. Calder. 2002. Introduction to time series and forecasting. Vol. 2. Springer, New York, NY, USA.
  5. Rebecca Brown, Collin Lynch, Yuan Wang, Michael Eagle, Jennifer Albert, Tiffany Barnes, Ryan Shaun Baker, Yoav Bergner, and Danielle S. McNamara. 2015. Communities of Performance & Communities of Preference. In CEUR Workshop Pro-ceedings, Vol. 1446. CEUR-WS, USA
  6. Jing Chen, Jun Feng, Xia Sun, Nannan Wu, Zhengzheng Yang, and Sushing Chen.2019. MOOC Dropout Prediction Using a Hybrid Algorithm Based on Decision Tree and Extreme Learning Machine. Mathematical Problems in Engineering 2019 (2019).
  7. KDD Cup. 2010.Educational Data Mining Challenge.(link)
  8. Gerben W. Dekker, Mykola Pechenizkiy, and Jan M. Vleeshouwer. 2009. Predicting Students Drop Out: A Case Study. International Working Group on Educational Data Mining (2009)
  9. Mucong Ding, Kai Yang, Dit-Yan Yeung, and Ting-Chuen Pong. 2018. Effective Feature Learning with Unsupervised Learning for Improving the Predictive Models in Massive Open Online Courses. (arXiv)
  10. Times Higher Education. 2020. Will the coronavirus make online education go viral? (link)
  11. Mi Fei and Dit-Yan Yeung. 2015. Temporal models for predicting student dropout in massive open online courses. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW). IEEE, 256–263.
  12. Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding Dropouts in MOOCs. In AAAI 2019.
  13. Elena Gaudioso, Miguel Montero, and Felix Hernandez-Del-Olmo. 2012. Supporting teachers in adaptive educational systems through predictive models: A proof of concept. Expert Systems with Applications 39, 1 (2012), 621–625.
  14. Niki Gitinabard, Farzaneh Khoshnevisan, Collin F. Lynch, and Elle Yuan Wang.2018. Your Actions or Your Associates? Predicting Certification and Dropout in MOOCs with Behavioral and Social Features. (arXiv)
  15. Cameron C. Gray and Dave Perkins. 2019. Utilizing early engagement andmachine learning to predict student outcomes. Computers & Education 131 (2019), 22–32.
  16. Liu Haiyang, Zhihai Wang, Phillip Benachour, and Philip Tubman. 2018. A Time Series Classification Method for Behaviour-Based Dropout Prediction. In 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT). IEEE, 191–195.
  17. HarvardX. 2014. HarvardX Person-Course Academic Year 2013 De-Identified dataset, version 3.0. (link)
  18. Jiazhen He, James Bailey, Benjamin I.P. Rubinstein, and Rui Zhang. 2015. Identifying at-risk students in massive open online courses. In 29th AAAI Conference on Artificial Intelligence.
  19. Ya-Han Hu, Chia-Lun Lo, and Sheng-Pao Shih. 2014. Developing early warning systems to predict students’ online learning performance. Computers in Human Behavior 36 (2014), 469–478.
  20. Usha Keshavamurthy and H. S. Guruprasad. 2014. Learning Analytics: A Survey. International Journal of Computer Trends and Technology (IJCTT) 18(6) (2014).
  21. Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. 2014. Predicting MOOC dropout over weeks using machine learning methods. In Proceedings ofthe EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. 60–65.
  22. Wentao Li, Min Gao, Hua Li, Qingyu Xiong, Junhao Wen, and Zhongfu Wu. 2016. Dropout prediction in MOOCs using behavior features and multi-view semi-supervised learning. In 2016 international joint conference on neural networks (IJCNN). IEEE, 3130–3137.
  23. Ioanna Lykourentzou, Ioannis Giannoukos, Vassilis Nikolopoulos, George Mpardis, and Vassili Loumos. 2009. Dropout prediction in e-learning courses through the combination of machine learning techniques. Computers & Education 53, 3 (2009), 950–965.
  24. Saurabh Nagrecha, John Z. Dillon, and Nitesh V. Chawla. 2017. MOOC dropout prediction: lessons learned from making pipelines interpretable. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 351–359.
  25. Bardh Prenkaj, Paola Velardi, Giovanni Stilo, Damiano Distante, and Stefano Faralli. 2020. A Survey of Machine Learning approaches for Student Dropout Prediction in Online Courses. Comput. Surveys 53-3 (2020).
  26. Lin Qiu, Yanshen Liu, Quan Hu, and Yi Liu. 2018. Student dropout prediction in massive open online courses by convolutional neural networks. Soft Computing (2018), 1–15.
  27. Belinda G. Smith. 2010. E-learning technologies: A comparative study of adult learners enrolled on blended and online campuses engaging in a virtual classroom. Ph.D. Dissertation. Capella University.
  28. Colin Taylor, Kalyan Veeramachaneni, and Una-May O’Reilly. 2014. Likely to stop? predicting stopout in massive open online courses. (arXiv)
  29. Hanghang Tong, Zhenhui Jessie Li, Feida Zhu, and Jeffrey Yu (Eds.). 2018. 2018 IEEE International Conference on Data Mining Workshops, ICDM Workshops, Singapore, Singapore, November 17-20, 2018. IEEE. (link)
  30. Wei Wang, Han Yu, and Chuyan Miao. 2017. Deep model for dropout prediction in moocs. In Proceedings of the 2nd International Conference on Crowd Science and Engineering. ACM, ACM, New York, NY, USA, 26–32.
  31. Mengxiao Zhu, Yoav Bergner, Yan Zhan, Ryan Baker, Yuan Wang, and Luc Paquette. 2016. Longitudinal engagement, performance, and social connectivity: a MOOC case study using exponential random graph models. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. ACM, ACM, New York, NY, USA, 223–230.

© Copyright 2020 AIIM-RG - All rights reserved.