DEVELOPMENT OF A MACHINE LEARNING ALGORITHM TO PREDICT AUTHOR’S AGE FROM TEXT

Author's age prediction is the task of determining the author's age by studying the texts written by them. The prediction of author’s age can be enlightening about the different trends, opinions social and political views of an age group. Marketers always use this to encourage a product or a service to an age group following their conveyed interests and opinions. Methodologies in natural language processing have made it possible to predict author’s age from text by examining the variation of linguistic characteristics. Also, many machine learning algorithms have been used in author’s age prediction. However, in social networks, computational linguists are challenged with numerous issues just as machine learning techniques are performance driven with its own challenges in realistic scenarios. This work developed a model that can predict author's age from text with a machine learning algorithm (Naïve Bayes) using three types of features namely, content based, style based and topic based. The trained model gave a prediction accuracy of 80%.


Introduction
The problem of identifying the author's age from the text is always of importance as it helps in various fields like forensics and marketing. Author profiling is used to create a profile of an author of a text. Such profile includes the age, gender, native language and the personality traits of the author. In author profiling, linguistic features were used to determine the profile of an author and the most common techniques that are used are different kinds of machine learning techniques (Elias Lundeqvist et al, 2017). The author's age is in strong connection with the author's language, as numerous sociolinguistic theories have proven. Depending on the individual's life stage, different linguistic approaches and choices are observed, resulting in the age linguistic variation. A basic principle that differentiates the language of adults from the "teens' language", is that adults use more standard types than adolescents, who prefer non-standard types and generally more unconventional language structures. The author's age from text is a serious sociolinguistics problem that requires a technical attention and this kind of situation requires a model to be built in order to provide a good ground author's age prediction from their text (Dong et al, 2011).
However, computational linguists are challenged with numerous issues in social networks. First of all, little information about the authors' gender, age, social class, race, geographical location, etc., is available to researchers (Herring, 2001). Indeed, most online social networks do not offer open access to the users' profile data. Hence, it is always difficult to collect training or labeled data for this task. Again, communication in online social networks typically occurs via posts on guestbook, blogs, walls, etc. These are usually very short messages, often containing non-standard language usage, which makes this type of text a challenging text genre for natural language processing and machine learning also. Furthermore, given the speed at which chat language has been created generally and continues to develop, especially among adolescents, another challenge in automatically detecting false profiles on social networks is the constant retraining of the machine learning algorithms in order to pick up new variations of chat language usage that are connected to age and/or gender (Herring, 2001).
Therefore, to predict these authors age, text documents can be classified according to a set of predefined classes, using a machine learning technique. The classification is performed based on features extracted from the text documents. These features will be used later to train the classifier. The classifier assigns classes to new data based on the statistics learned from the labeled dataset. Hence, Naive Bayes classifier as a machine learning technique was used.

Literature Review
Recently, machine learning approaches have been discovered to estimate the age of an author using text written by the person. This has been modeled as a classification problem, in a similar spirit to sociolinguistic work where age has been investigated in terms of differences in distributions of characteristics between cohorts (Dong Nguyen et al, 2011). In machine learning research, these cohorts have usually been determined for practical reasons relating to distribution of age groups within a corpus, although the boundaries sometimes have also made sense from a life stage perspective. For example, researchers have modeled age as a two-class classification problem with boundaries at age 40 (Garera and Yarowsky, 2009)  negative words, more future-tense and less past-tense, and fewer self-references. The results also included, observing a general pattern of increasing cognitive complexity. Barbieri, (2008) used key word analysis to investigate language and age. Two groups (15-25 and 35-60) were compared. Results showed that younger speakers' speech are characterized by slang and swear words, indicators of speaker stance and emotional involvement, while older people tend to use more modals. Morgan et al, (2017) examined the separate and joint predictive validity of linguistic and metadata features in predicting the age of Twitter users. The work created a labeled dataset of Twitter users across three age groups (youth, young adults, adults) by collecting publicly available birthday announcement tweets using the Twitter Search application programming interface and logistic regression.

Methodology
For text classification, a supervised machine learning method, Naive Bayes classification was used. Naive Bayes classifier is based on Bayes theorem of calculating posterior probability: Posterior probability = Prior Probability x Likelihood or for a document d and a class c written as: P(c|d) = P(d|c) P (c)/P(d) A Naive Bayes classifier makes the assumption that all attributes (input features) are independent of each other.
And for generating set of input features from text, Naive Bayes classification is using text document matrix (bag-of-words model). The model was trained using the content based, style based and topic based features.

System Design and Implementation
The rules for author's age prediction from text The author's age prediction from text of a naive Bayes classifier based on the posterior probabilities can be stated as: if P(ω=authortext |x)≥P(ω=age|x) classify as authortext, else classify as ham…………… (1) .
Again, for the evaluation of classification algorithms on the task of author's age prediction, a standard approach was adopted, i.e. data preprocessing, feature extraction and model/classifier training, as illustrated in Specifically, each author text from age class post is initially preprocessed. During pre-processing, each post is broken into sentences and each sentence is split into words. Subsequently, the feature extraction approaches are applied in parallel and independently to each post. The text mining, linguistic based and context-based features are extracted; constructing vectors VTM, VSL and VCT, respectively. These features are consequently concatenated to a super vector V = VTM || VSL || VCT. This results to one feature vector, V, per author's age from text, which is handled by a classification algorithm in order to label each post with an age class

Data Set Description
The analysis was based on a particular topic from a given debate. The following topics were given: Declining education standards is caused by laziness on the part of student; Men always shy away from their marital responsibility; A good character is better than beauty respectively.
These topics were sent to the student's e mail for responses and classified into three age groups 15-20, 24-30 and 30-34. These data were collated with respect to their age classification and used to train the model for predicting author's age from text.

System Implementation
In the implementation of authors age prediction from text, the training data, test data and 10 cross validation was set up to build the model, that is 90% of training data, testing set of sample data and 10 cross validation. The total sample data used was hundred (100). Weka plugin and java programming language were used. Netbeans IDE was used to integrate the work into the user interface module as shown in figures 5.1 to 5.4.

Results and Discussions
Correctly

Conclusion and Future Work
A good system for author's age prediction is required in various domains ranging from analyzing sensitive text for national security to commercially important data from various comments and product reviews. This work was able to model the author's age using the writing style and contents of the text. It can be seen that best results were achieved when the context information was used along with the content and style of the texts using a machine learning algorithm, Naïve Bayes for the prediction. Future efforts can be made in this work by introducing inducing sentiment analysis to discover more differences in text written by authors representing different classes. This may yield a much better accuracy rates in identifying the author's profile.