Deep Learning Based Analysis of Student Aptitude for Programming at College Freshman Level

Predicting Freshman student’s aptitude for computing is critical for researchers to understand the underlying aptitude for programming. Dataset out of a questionnaire taken from various Senior students in a high school in the city of Kanchipuram, Tamil Nadu, India was used, where the questions related to their social and cultural backgrounds and their experience with computers. Several hypotheses were also generated. The datasets were analyzed using three machine learning algorithms namely, Backpropagation Neural Network (BPN) and Recurrent Neural Network (RNN) (and its variant, Gated Recurrent Network (GNN)) with K-Nearest Neighbor (KNN) used as the classifier. Various models were obtained to validate the underpinning set of hypotheses clusters. The results show that the


Introduction
Computer science programs around the world always have students facing problems with programming courses, such as Programming Principles, Object-Oriented Programming and Data Structures. Most of the time, students fail later programming courses because of the lack of foundation at Freshmen level and, this has something to do with their aptitude for programing. In order to help potential applicants of the program to understand the kind of competencies of abstract and logical thinking skills that are needed for computer science programing courses and to help the University admissions office in selecting potential candidates for such courses, a study has been carried out among final year high school students. A model that helps in predicting student aptitude for programming based on several features was built. Datasets acquired through a questionnaire were analyzed using data analytics software, e.g., Python Libraries, PyTorch and TensorFlow. The model used machine learning to classify the student aptitude for computer programming, which helps in identifying students who are at risk of failing; note that improving the passing rates in introductory courses has a direct impact on retention rate also. Unlike other studies, which often correlate student's aptitude to programming to student's past academic performance, this study takes into account student's family background and individual's interaction with technology also; the authors feel that these are foundational factors in student's attitude to programming. The rest of the paper is organized as follows: section 2 describes the problem statement and related literature, while section 3 details the hypotheses and the questionnaire design. Section 4 presents the analysis approach & deep learning algorithms used, while section 5 provides the results of the study.

Problem Description & Related Literature
The literature review has two components, namely, i) those relating to model development and related factors that affect aptitude for programming and ii) the use of machine learning techniques for analyzing student's aptitude for programming. Identifying the factors that affect student aptitude to programming can also help us to understand how students learn to program, which in turn can help to plan the needed intervention at an early stage and avoid the risk of student retention in the program. Most of the factors that can be situated with one's aptitude for computer programming are abstract thinking, logical thinking, mathematical skills, and problem-solving skills. Some studies indicate that gender plays a role and in this modern age, males do dominant the world of computer Science. Results from several studies show that there is a statistically significant difference in programming performance between male and female students, where male students performed better than their female counterparts. A key factor that researchers omit is students background, but instead assume that student's performance is based on their previous grades or such skills as problem solving skills. Psychological and sociological factors play a big part in student's aptitude for programming and such factors as predictors can be helpful in understanding the process that students go through when learning (Longi, September 26, 2016). Other studies indicate that just the behavior of a student during lectures and labs play a huge impact towards aptitude for programminge.g., gestures, outbursts, and other factors including collaboration with other students (Ahadi & Lister, 2015). However, these studies lack depth of statistical significance in the larger context of the underlying question. A small amount of literature exists on using machine learning techniques for analyzing student's aptitude for programming. A Master's thesis (Longi, September 26, 2016) details the use of Bayesian network to model the relationship between factors that affect programming performance. Further detailed literature analysis can be obtained from the authors, which due to space limitations has been curtailed, but can be obtained from the authors.

Hypotheses & Questionnaire Design
In our earlier detailed research work [7], we developed a set of hierarchical hypotheses and corresponding questionnaire for comprehending Freshman aptitude for computer programming -as shown in Appendix-A. They deal with psychological and sociological aspects of student's views on computing and programming. It is noted that these hypotheses and questionnaire are better suited for Third World countries and rural students in the First World countries, who in general are not exposed to computing and computers much. The questionnaire was distributed to Senior students in a high school in the city of Kanchipuram, Tamil Nadu, India, and a large body of data was collected.

Dataset Description & Analysis Basics
It is noted that the questionnaire (vide. Appendix-A) is in two parts, viz., Part-A and Part-B. Details from the questionnaire were converted into a dataset format, which was further cleaned for any errors; missing values were normalized using mean values. The dataset consisted of 157 instances, with 35 attributes. The Likert Scale was used to rate each of the questions asked and it ranged from 0 to 7, where 0 is none, 1 -6 being Strongly Disagree to Strongly Agree and 7 being not applicable. Part-A of the dataset deals with students' biographical backgrounds which are detailed in Fig.1 & 2.

Hypotheses Clustering
Appendix-A provides the list of hypotheses (H1-H10) and their grouping of the underlying questions into H0 to H10. Once grouped next was selection of the hypothesis were correlated to generate inferences. In this case C1 represents the correlation of hypotheses H0, H1, H2; C2 represents the correlation of H3, H6, H5; C3 represents the correlation of hypotheses H8, H8.1, H8.2 and, C4 represents the correlation of hypotheses H7, H9 and H10. Fig.5 presents an overview of the workflow, whereas three different machine learning approaches were selected, namely, BNN and RNN (GRU) along with K-nearest neighbor algorithm for clustering; a brief overview of these techniques are presented in Appendix-B. Further, while data analytics open source libraries such as, PyTorch, WEKA, TensorFlow and RapidMiner are available, TensorFlow was chosen because: i) it provides excellent functionalities and services when compared to other popular deep learning , 2) has lowlevel libraries which provides more flexibility and 3) a highly interactive development environment that allows for design-as-you-go flexibility. Agile process model was used to develop Machine Learning based model/s as shown in Figure 4.

Analysis of Results
The models were evaluated using the performance metrics of accuracy, precision, recall, and F1-score, which are defined as follows: • Accuracy: characterizes the degree to which a predicted value agrees with an actual value (Devasia & Vinushree, 2016). • Precision: identifies the probability of a positive test result.
High precision values indicate that the probability of the test set being accurately classified is high. In the context of this paper, precision indicates the number of students having aptitude for programming. • Recall: evaluates the number of true positives of the actual class predicted by the models. • F1-score: indicates the best performing algorithm.

Model Implemention Summary Backpropagation Neural Network (BPN)
With the Backpropagation Neural Network (BPN), the datasets were randomly divided into training and test data in the ratio of 80:20; note that Ward, Peters, & Shelley (2010) state that if the size of the training dataset is too small or too large, the performance of the models will be affected. The output variables are the mean of each of the correlation regression. ADAM optimizer [10] was used for each model instead of the classical stochastic gradient descent procedure to update network weights. ADAM updates all parameters with individual learning rates so that every parameter in the network has a specific learning rate.
Correlation C1 relates to hypotheses H0, H1 and H2 and in this case the model was built with one input layer, two hidden layers and one output layer which took an input dimension of 7, 7 representing 7 features (or questions) based on the three hypotheses. The input layer contains 14 nodes with initial weight being uniform and activation function being Sigmoid [11]. The two hidden layers contain 8 nodes and 4 nodes respectively, both having initial weights being uniform and activation function being Sigmoid, while the output layer has one node.
The first test of the model using the number of epochs set to 100 yielded an accuracy of 37.5%, as shown in Fig.6. As the number of epochs increased from 100 to 400 and a change in the loss function from binary cross to MSE (Mean Square Error), the prediction accuracy became 94%; further the use of MSE was better than the binary cross function (Fig.7). However, the Loss graph still did not converge, thereby meaning that the model has not reached the stopping criteria. With 900 epochs, the model yielded a prediction accuracy of 93.75%. Based on the three tests, the loss graph starts converging at approximately 600 epochs. The stopping criteria of this model would possibly be at 600 epochs as anything beyond 900 gives an accuracy of 93.750% as can be evident from Figure 8; the accuracy remains constant from the 500 epochs and beyond. The corresponding Confusion matrix is shown in Figure 9.   Correlation 2 relates to hypotheses H3, H5 and H6 and has seven input parameters and the co0orresponding model was built using one input layer, four hidden layers and one output layer. With 400 epochs along with Sigmoid activation function and MSE as the loss function, the model gave a prediction accuracy of 79.411765 %.
With an increase in the number of epochs from 400 to 700 (Figs. 10 & 11), the model yielded an accuracy of 87.5%. With 1200 epochs, the Prediction accuracy is 90.625% and the corresponding confusion matrix as shown in Figure 12.
Correlation 3 relates to hypotheses H8.0, H8.1 and H8.2 and has seven input parameters. The model gave a prediction value of 96.875% with the test of 800 epochs (Figs. 13 & 14) and was declared to yield high prediction accuracy. The Confusion Matrix is shown in Fig.15.   (Figs. 13 & 14) and was declared to yield high prediction accuracy. The Confusion Matrix is shown in Fig.15.

Gated Recurrent Unit (RGU-ANN)
The datasets were randomly divided into training and test data on the ratio of 80:20. The output variables were the mean of each of the correlation regression and ADAM optimizer was used for each model. All the models were built with activation='Sigmoid' and loss='MSE' and with a batch size of 15.    On Correlation 3, the model yields a prediction accuracy of 96.87500% with the number of epochs set to 500;the model is declared as having the highest prediction accuracy (Fig.24); the corresponding confusion matrix is shown in Figure 25. On Correlation 4, the model shown in Fig.26 yields a prediction value of 96.875% at 400 epochs, which was the highest accuracy; the confusion matrix is shown in Fig.27.

K-Nearest Neighbor (KNN)
The machine learning algorithms were applied using hypothesis variables as input to predict aptitude for programming. The KNN was constructed using Python and Kera's library and the training and test data ratio was set at 80:20. The N-neighbor for each of the model was set to 5; note that the rule of thumb says K = Square Root of N divided by 2, where N is the number of samples in the training set.
For correlation 1, the model yields the best prediction accuracy of 84.375% at 5 nearest neighbors; the corresponding confusion matrix is shown in Fig.28.

Summary of Analysis
The summary of the analysis is captured as Table-1, wherein the respective correlations are plotted against their F1-scores

Conclusion
The objective of this study was to build a model that predicts Freshman student's aptitude for computer programming using Machine learning algorithms. Several hypotheses were conjectured, and corresponding questionnaire generated; they were given to school final students in India and the dataset was collected. Four models were built for each ANN based on four correlations generated using clustered hypotheses set; KNN was used as a classifier. The performance of the models was computed using the test dataset, which was 20 % of the original dataset. The results show that the BPN model/s achieved high accuracies in predicting the Freshman student's aptitude for computer programming. The best correlation scores for the clustered hypotheses C1, C2, C3 and C4 were 94%, 91%, 97%, 91% respectively. Although the best model was the BPN, it took the second longest time to train unlike the KNN, while the first being RNN. The results show that the models can be employed to predict Freshman student's aptitude for programming. Further work in this arena include the use of Convolutional Neural Network (CNN) to study student's aptitude for programming and also generate several useful 3-d metrics.

Brief Overview of the Algorithms Used Multiple Linear Regression
Multi-Linear regression is the process of using many independent variables to determine one dependent variable (many to 1 relationship). In Multiple Linear Regression, we try to find relationship between two or more independent variables (inputs) and corresponding dependent variable (output). The independent variables can be continuous or categorical.

Artificial Neural Network (Ann)
Artificial Neural Network (ANN) can be defined as information processing tools which mimic or copy the learning methodology of the biological neural networks. It derives its origin from the human nervous system, which consists of massively parallel large interconnection of large number of neurons, which activate different perceptual and recognition task in small amount of time. The last part of the research dealt with focusing on the use of my ANN to come up with a model that would predict student aptitude for programming based on the hypothesis elucidated in 3-Parameter Classification. For these four different models based on the correlation from each of the hypothesis were designed. The selected artificial neural networks are:

Backpropagation Neural Network (BNN)
Backpropagation is a feed forward neural network algorithm, which works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule. Backpropagation is a short form for "backward propagation of errors." It is a standard method of training artificial neural networks. This method helps to calculate the gradient of a loss function with respect to all the weights in the network.

Recurrent Neural Network (RNN) -Gated Recurrent Network
Recurrent Neural Network (RNN) is a type of Neural Network where the output from previous step is fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other. Thus, RNN came into existence, which solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is Hidden state, which remembers some information about a sequence. They are especially powerful in use cases in which context is critical to predicting an outcome and are distinct from other types of artificial neural networks because they use feedback loops to process a sequence of data that informs the final output, which can also be a sequence of data. The feedback loops allow information to persist; the effect is often described as memory. RNNs built with LSTM units categorize data into short term and long-term memory cells. Doing so enables RNNs to figure out data that is important and should be remembered and looped back into the network, and the data that can be forgotten or left out.

K-Nearest Neighbor
K-Nearest Neighbor (KNN) Algorithm uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction. (Atul, 2020