|
ShodhKosh: Journal of Visual and Performing ArtsISSN (Online): 2582-7472
Predicting Graduate Employability Using Ensemble Learning on Resume Data, Internships, and Soft Skill Scores
Dr. Maroti V. Kendre 1 1 Assistant Professor, School of Liberal
Arts, Pimpri Chinchwad University Pune, Maval (PMRDA) Dist. Pune, Maharashtra,
India 2 Pimpri
Chinchwad College of Engineering, Department of Electronics and
Telecommunication Engineering, Pune, Maharashtra, India 3 Assistant
Professor, Walchand Institute of Technology, Solapur, Solapur, Maharashtra,
India 4 Assistant Professor, Department of
MBA, Institute: Modern Institute of Business Studies Nigdi, Savitribai Phule
Pune University Pune, Maharashtra, India 5 Assistant Professor, CSMSS
Chhatrapati Shahu College of Engineering, Chhatrapati Sambhajinagar, Maharashtra,
India 6 Training and Placement Officer, Dr.
D. Y. Patil Technical Campus, Varale-Talegaon, Pune, Maharashtra, India
1. INTRODUCTION In today's knowledge economy, a graduate's ability to find work has become a key measure of how well higher education schools and academic programs work. There is no longer just a link between college credentials and professional work. There is now a complicated web of technical skills, real-world experience, and social skills that makes the shift possible. As global job markets become more competitive and changeable, schools have to change too. To do this, they need to include models that help students get jobs in their teaching methods. At the same time, there is a greater need for organised, data-driven ways to assess and predict the usefulness of college graduates. This has led to the use of advanced machine learning techniques on educational and behavioural data. In the past, figuring out if someone was employable mostly depended on subjective judgements or broad measures, which didn't always take into account how job readiness is multidimensional. Digital learning platforms, job tracking systems, and behavioural evaluation tools, on the other hand, have made it possible to collect rich, organised data about a student's academic journey, learning through experience, and personal skills. Resume data, which is often seen as the most important record for getting a job, includes a lot of different things, from school credentials to project participation and skill references. Using natural language processing (NLP) methods on this kind of data can turn it into vectors that can be used for computer analysis and have meaning. In addition, internships are useful ways to show that you are ready for a career because they show you how to change to new situations, gain experience in a certain field, and work with others. Also, companies are becoming more aware that "soft skills" like communication, teamwork, flexibility, and emotional intelligence are important for job performance and often more important than technical skills for long-term career success. Putting these different kinds of data together is both a chance and a problem for predictive modelling. A lot of progress has been made in using machine learning techniques, especially ensemble learning methods, to find hierarchical patterns and complex connections in different types of data. Ensemble models like Random Forests, Gradient Boosting Machines (GBM), and Extreme Gradient Boosting (XGBoost) combine the best predictive skills of several base learners. This improves generalisation performance and lowers overfitting. When working with high-dimensional data that has noise and multicollinearity, these methods work especially well. When it comes to figuring out if someone will be employable, ensemble models can be set up to learn how cognitive signs, experience factors, and psychological traits interact with each other in complicated ways. This makes the results more reliable and easier to understand. Using stacking, an advanced ensemble strategy that combines predictions from multiple base models through a meta-learner, also improves the accuracy of predictions. By using the best features of different models that work well together, stacking creates a system that can find small patterns in data. This is especially important when the feature area has both organised and unstructured data, like scores in numbers and job details in text. Furthermore, ensemble methods make it easier to analyse the value of features, which helps stakeholders find key factors that affect results related to hiring. These lessons can help create courses, career advice programs, and policies that will help students get the job-related skills that employers want. Even though predictive analytics for educational results are becoming more popular, not many studies have looked at employment through a view that includes academic, social, and behavioural factors. This gap shows how important it is to have complete models that not only predict job opportunities but also help institutional partners understand them. A more complete picture of a graduate's employability can be gained by using ensemble learning on a single dataset that includes resume traits, job records, and soft skill tests. In this situation, the suggested study uses a group-based machine learning system to combine these different types of information and accurately guess how employable college graduates will be. The method focusses on both accurate predictions and easy understanding, meeting the needs of both practical use and academic understanding. The system is ready to make a big difference in the fields of education data mining and labour analytics thanks to its thorough model training, cross-validation, and feature analysis. In the end, the study wants to help students, teachers, and lawmakers make decisions based on data in order to create educational environments that focus on employment. 2. Related work Predicting how easily graduates will be able to find work has been a major topic of study in both educational data mining and human resource analytics. There is already writing that looks at different aspects of this problem, but there are still some methodological and cultural problems that need to be fixed. Sharma et al. used school and work experience records to look at how well people were able to get jobs. Using models like Decision Trees and SVM, their study proved that academic success and the image of the school have a modest effect on an individual's ability to get a job. But they only looked at organised academic data and didn't look at soft skills or traits that are based on experience, which made them less useful in the real world. Kumar and Bansal used TF-IDF and Word2Vec embeddings to introduce natural language processing (NLP) for resume screening. They showed that resume grammar greatly improves the match between a candidate and a job. The study used raw text data in a new way, but it didn't include group methods or interpretability models, which are important for both accuracy and trustworthiness. Patel et al. looked at factors related to gender, CGPA, and speaking skills, as well as social and academic factors. Using simple models like Naïve Bayes and Random Forest, they found that emotional and academic traits played a big part. But the model didn't include data on internships and resume meanings, which are becoming more and more important in current hiring situations. Desai et al. used unstructured methods like K-Means and PCA to look into skill-based grouping. They showed that skill groups could help make sure that training programs were in line with what jobs needed. But the study didn't link these groups to real results in terms of hiring, so from a predictive analytics point of view, it wasn't very useful. Arora and Gupta used deep learning models like CNN and LSTM and discovered that they worked better than standard methods in large datasets. Still, because they were black boxes, they were hard to understand, and there was no way to measure soft skills. Table 1
In general, earlier research has mostly looked at single aspects, like ordered school records or unorganised resume texts, without putting together a complete dataset. Findings can't be used by everyone because there are holes in the models that can explain them and behavioural or experiential data wasn't included. Because of these problems, this study uses a group-based, multi-source approach that includes jobs, resume embeddings, and measurable soft skills, while still keeping the study's interpretability through SHAP values. 3. System Architecture 3.1. Data Acquisition and Preprocessing The first step is to get the "Job Placement Dataset" from Kaggle and prepare it by cleaning it up. This dataset includes information about demographics, education, and employment status. There are both categories and number factors in the raw information, so an organised preparation workflow is needed. With one-hot encoding, categorical factors like "gender," "specialisation," and "work experience" are stored. When you use one-hot encoding on a category variable (C_i∈ \ {c_1,c_2,...,c_k}), it turns into a binary vector (v ⃗∈ \ {0,1\ }^k), where only one item is 1 and the rest are 0. Numerical attributes such as SSC percentage (x_1), HSC percentage (\ x_2), degree percentage (\ x_3), and MBA percentage (x_4) are normalized using min-max normalization:
This makes sure that all numbers are in the range [0, 1], which helps gradient-based models agree.
Figure 1 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Table 2 Comparative Analysis |
|||||
|
Model |
Accuracy (%) |
Precision (%) |
Recall (%) |
F1-Score (%) |
AUC-ROC (%) |
|
Random Forest |
88.21 |
87.34 |
86.45 |
86.89 |
90.42 |
|
AdaBoost |
86.14 |
84.67 |
85.02 |
84.84 |
88.91 |
|
Gradient Boost |
89.37 |
88.12 |
87.76 |
87.94 |
91.15 |
|
XGBoost |
91.25 |
90.38 |
89.77 |
90.07 |
93.24 |
XGBoost consistently performs better than other options, using a variety of features (such as resume embeddings, soft skill measures, and job quantifiers) to accurately capture the complex factors that affect a graduate's ability to find work. So, both statistics and real-world experience show that the model is very good at making predictions.
Figure 3

Figure 3
Comparison of
Accuracy with various Model
The figure(3) illustrates the accuracy of each model in predicting graduate employability. XGBoost achieves the highest accuracy at 91.25%, followed by Gradient Boost at 89.37%. Random Forest also performs well, while AdaBoost exhibits the lowest accuracy at 86.14%. The visual contrast in bar heights clearly showcases XGBoost's superior performance, affirming its effectiveness in handling complex feature interactions across academic, experiential, and behavioral domains.
Figure 4

Figure 4
Comparison of
Precision of various model
The figure (4) checks how well each model can find all grads who can get jobs. The best number is 89.77% for XGBoost, which shows that it is very good at finding true positives. The next closest is Gradient Boost. AdaBoost and Random Forest have slightly lower recall.
Figure 5

Figure 5
Comparison of Recall
with various model
The figure (5) checks how well each model can find all grads who can get jobs. The best number is 89.77% for XGBoost, which shows that it is very good at finding true positives. The next closest is Gradient Boost. AdaBoost and Random Forest have slightly lower recall. The bright magma bars make the differences stand out more, and it's clear that models with boosting processes tend to be more sensitive than models with bagging methods.
Figure 6

Figure 6
Comparison of
F1-score with various model
As a harmonic sum of accuracy and memory, the F1 score is shown in the figure (6). With a score of 90.07%, XGBoost once again wins, showing that it has the best balance between accuracy and memory. The steady rise from AdaBoost to XGBoost shows that boosting algorithms are getting better at what they do. The clear marks and uniform curves make it easier to find performance gaps and trends across group models.
Figure 7

Figure 7
Comparison of
AUC-ROC with various model
The figure(7) shows how well each model can tell the difference between grads who can and can't get jobs at different levels. At 93.24%, XGBoost gets the best number, which suggests it has better discriminative power. AdaBoost and Random Forest are a little behind, and Gradient Boost is right behind it. XGBoost's great general classification skills are shown by the bright green line with smooth changes that catches the predicted quality gradient.
4. Conclusion
This study showed that ensemble learning can accurately predict a graduate's ability to find work by combining different types of data, such as academic success, job records, soft skills, and resume meaning. Out of all the ensemble models that were looked at, Extreme Gradient Boosting (XGBoost) did the best in all important evaluation measures, such as accuracy, precision, recall, F1-score, and AUC-ROC. This stability in performance shows how well gradient-boosted systems work with different types of data that are only partially organised. The dataset was changed into a format that allowed for more accurate pattern recognition through a lot of feature engineering and dimensionality reduction. Adding TF-IDF-based resume embeddings and measured soft skill measures was very important because these were the factors that were most accurate. While traditional academic factors are still important, they were found to play a smaller role compared to other factors. This suggests that behavioural and practical factors are becoming more important in determining employment. The model's ability to be understood, especially through SHAP analysis and gain-based feature value scores, not only makes things clearer but also gives us useful information we can use. These lessons could help schools make better decisions about how to improve their courses, run skill-building programs, and offer career counselling services to help graduates get better jobs. The methods and results shown here show that machine learning can be used for more than one purpose, like predicting the future of the workforce and analysing data about schools. In the future, researchers may look into adding real-time labour market trends or long-term student success data to improve the ability to predict the future and respond to changing job markets.
CONFLICT OF INTERESTS
None.
ACKNOWLEDGMENTS
None.
REFERENCES
Albina, A., and Sumagaysay, L. (2020). Employability Tracer Study of Information Technology Education Graduates from a State University in the Philippines. Social Sciences and Humanities Open, 2, 100055. https://doi.org/10.1016/j.ssaho.2020.100055
Assegie, T. A., Salau, A. O., Chhabra, G., Kaushik, K., and Braide, S. L. (2024). Evaluation of Random Forest and Support Vector Machine Models in Educational Data Mining. In Proceedings of the 2nd International Conference on Advancement in Computation and Computer Technologies (InCACCT) (131–135). IEEE. https://doi.org/10.1109/InCACCT61598.2024.10551110
Aviso, K. B., Janairo, J. I. B., Lucas, R. I. G., Promentilla, M. A. B., Yu, D. E. C., and Tan, R. R. (2020). Predicting Higher Education Outcomes with Hyperbox Machine Learning: What Factors Influence Graduate Employability? Chemical Engineering Transactions, 81, 679–684.
Celine, S., Dominic, M. M., and Devi, M. S. (2020). Logistic Regression for Employability Prediction. International Journal of Innovative Technology and Exploring Engineering, 9(3), 2471–2478. https://doi.org/10.35940/ijitee.C8170.019320
Chopra, A., and Saini, M. L. (2023). Comparison Study of Different Neural Network Models for Assessing Employability Skills of IT Graduates. In Proceedings of the International Conference on Sustainable Communication Networks and Application (ICSCNA) (189–194). IEEE. https://doi.org/10.1109/ICSCNA58489.2023.10368605
Maaliw, R. R., Quing, K. A. C., Lagman, A. C., Ugalde, B. H., Ballera, M. A., and Ligayo, M. A. D. (2022). Employability Prediction of Engineering Graduates Using Ensemble Classification Modeling. In Proceedings of the IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) (288–294). IEEE. https://doi.org/10.1109/CCWC54503.2022.9720783
Monteiro, S., Almeida, L., Gomes, C., and Sinval, J. (2020). Employability Profiles of Higher Education Graduates: A Person-Oriented Approach. Studies in Higher Education, 1–14. https://doi.org/10.1080/03075079.2020.1761785
Nordin, N. I., Sobri, N. M., Ismail, N. A., Mahmud, M., and Alias, N. A. (2022). Modelling Graduate Unemployment from Students’ Perspectives. Journal of Mathematical, Computational and Statistical Sciences, 8(2), 68–78. https://doi.org/10.24191/jmcs.v8i2.6986
Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., and Lozano, J. A. (2021). Machine Learning and Knowledge Discovery in Databases. Springer Nature. https://doi.org/10.1007/978-3-030-86486-6
Philippine Statistics Authority. (2021). Unemployment Rate in September 2021 is Estimated at 8.9 Percent.
Shahriyar, J., Ahmad, J. B., Zakaria, N. H., and Su, G. E. (2022). Enhancing Prediction of Employability of Students: Automated Machine Learning Approach. In Proceedings of the 2nd International Conference on Intelligent Cybernetics Technology and Applications (ICICyTA) (87–92). IEEE. https://doi.org/10.1109/ICICyTA57421.2022.10038231
Shuker, F. M., and Sadik, H. H. (2024). A Critical Review on Rural Youth Unemployment in Ethiopia. International Journal of Adolescence and Youth, 29(1), 1–17. https://doi.org/10.1080/02673843.2024.2322564
Tamene, E. H., Salau, A. O., Vats, S., Kaushik, K., Molla, T. L., and Tin, T. T. (2024). Predictive Analysis of Graduate Students’ Employability using Machine Learning Techniques. In Proceedings of the International Conference on Artificial Intelligence and Emerging Technology (Global AI Summit) (557–562). IEEE. https://doi.org/10.1109/GlobalAISummit62156.2024.10947923
|
|
This work is licensed under a: Creative Commons Attribution 4.0 International License
© ShodhKosh 2025. All Rights Reserved.