Customer gender prediction system on hierarchical E-commerce data

E-commerce services provide online shopping sites and mobile applications for small and medium sellers. To provide more efficient buying and selling experiences, a machine learning system can be applied to predict the optimal organization and display of products that maximize the chance of bringing useful information to user that facilitate the online purchases. Therefore, it is important to understand the relevant products for a gender to facilitate the online purchases. In this work, we present a statistical machine learning (ML)-based gender prediction system to predict the gender “male” or “female” from transactional E-commerce data. We introduce different sets of learning algorithms including unique IDs decomposition, context window-based history generation, and extract identical hierarchy from training set to address the gender prediction classification system from online transnational data. The experiment result shows that different feature augmentation approaches as well as different term or feature weighting approaches can significantly enhance the performance of statistical machine learning-based gender prediction system. This work presents a ML-based implementational approach to address E-commerce-based gender prediction system. Different session augmentation approaches with support vector machines (SVMs) classifier can significantly improve the performance of gender prediction system.


Background
This work presents research on implementing a machine learning (ML)-based automatic text classification (ATC) [1][2][3][4] system for gender prediction on transnational Ecommerce data. E-commerce services provide online shopping sites and mobile applications for small and medium sellers. To provide more efficient buying and selling experiences, a machine learning system can be applied to predict the optimal organization and display of products that maximize the chance of bringing useful information to user that facilitate the online purchases. In this work, we address this problem to predict the gender "male" or "female" from browsing history using statistical machine learning technique. A set of different approaches are considered to enhance the gender prediction system. We introduce different feature augmentation approaches including unique IDs decomposition, context windowbased history generation, and extract identical hierarchy from training set to address the gender prediction classification system from online transnational data. The experiment result shows that different feature augmentation approaches as well as different term or feature weighting approaches can significantly enhance the performance of statistical ML-based gender prediction system.
In E-commerce, business-to-business-to-customer (B2B2C) runs several services that provide online shopping sites and mobile applications for small and medium sellers. Transaction data, such as product browsing and purchasing activities, from buyer, and product portfolio, from seller, can be aggregated, to provide more efficient buying and selling experiences. For example, statistical machine learning (SML) techniques can be applied to predict the optimal organization and display of products that maximize the chance of bringing useful information to user, facilitate the online purchases.
The primary motivation of exploiting the session augmentation-based approaches can be attributed to two main properties. First, to create a more effective features that can enhance the system performance from product viewing transaction sessions by augmenting the sessions' S = {s 1 , s 2 , ..., s j } features. Second, is to create different simple term or feature weighing approaches that can generate a more effective classifier and can further boost the system performance. The term or feature weight is the degree of importance of term or feature t i in each transaction session s j . The term weighting approach plays a very significant role to enhance automatic text classification (TC). Therefore, an effective term or feature weighting approach can generate more informationrich terms and assign appropriate weighting values to the terms.
Most of the works in SML-based approaches are based on either document-or class-indexing which is incorporated with document space [2] or category space [3,4]. In this dataset, the transactional sessions or log sessions are free from either in document or in class space. Therefore, to deal with such data for statistical classification system, we use binary approach, cosine binary approach to normalize the features of each session, and finally normalize term frequency-based approach to address the system.
In this work, each session of sessions from transaction data is modeled as feature vectors, extracted from the session product IDs. The classification task for gender prediction is a binary classification problem, where a session is labeled as "male" or "female" category. The trainable classifier is expected to learn the patterns, by identifying relevant feature values which are most correlated with the classes "male" or "female." When a new session or log file is given to the system, the learned patterns are used to classify each session into either a "male" or "female" session and give it a certain score value between "0" and "1." Our goal is to show that an effective gender prediction system can be created relying solely on different session augmentation and simple weighting approaches to enhance the classification task. The session augmentation with unique IDs decomposition is capable to produce good classification score. Session augmentation with context window approach along with unique IDs decomposition can also boost the system performance. Finally, session augmentation with identical hierarchy incorporated with all together can further boost the system performance. Therefore, these approaches are effective to improve the SML-based gender prediction system.
Recently, many experiments have been conducted using different term weighting approaches [1,[3][4][5][6][7] to address the classification task as a statistical method. Document-indexing-based and four fundamental information-element-based [3,4] weighting approaches are considered the most popular term weighting method in ATC.
Recently, many experiments have been conducted using a document-and class-indexing-based term weighting approach to address the classification task as a statistical method [2,[7][8][9][10][11]. TF.IDF is the most popular term weighting method in successfully performing the ATC task and document-indexing [2]. Salton Buckley [11] discussed many term weighting approaches in the information retrieval (IR) [12][13][14][15][16] field and found that normalized TF.IDF is the best document weighting function. Therefore, TF.IDF is considered as one of the most standard weighting approaches in statistical machine learning-based approaches, especially in text classification. In contrast, recently Sohrab [3,4] proposed classindexing-based (TF.IDF.ICS δ F) indexing, a subset of documents from the global document space D = {d 1 , d 2 . . . d n } is allocated to a certain class c k where (k = 1, 2, . . . m)) according to their topics in order to create a boundary line vector space in the training procedure. Therefore, the class space is defined as Cm} where a set of documents with same topics is assigned to a certain class c k . With class-indexing-based term weighing approach they have outperformed over all traditional weighting approaches.
In the last few years, researchers have attempted to improve the performance of TC by exploiting statistical classification approaches and machine learning techniques, including probabilistic Bayesian models [17], support vector machines (SVMs) [18,19], decision trees (Lewis and Ringuette) [17], Rocchio classifiers [20], and multivariate regression models [21]. Among them, SVMs-based classifier achieved great success in classification problem. Therefore, we adopt the SVMs as a classifier to learn and predict the gender prediction system.

Methods
The proposed ML-based gender prediction system follows the steps, including (i) session augmentation approach to generate candidate features, (ii) session to vector generation, and (iii) classifiers.

Session augmentation approach
Session augmentation approach can be decomposed into three different representations including (i) session augmentation with unique IDs decomposition, (ii) session augmentation with context window, and (iii) session augmentation with identical hierarchy. In subsequent sections, we will clarify how these augmentations are produced.

Session augmentation with unique IDs decomposition
In the statistical-based classification method [13,15,[22][23][24][25], it is important to generate good features from text and then apply term weighting approach to generate text to vector which is an input of a classifier to address classification problem. An example of session or single product viewing log from training dataset is composed of four columns and can be read as, u10001 A distribution of product IDs in the dataset is decomposed into two different combinations uni-gram and bigram compositions.

Bi-gram-based feature composition
The product IDs follow the hierarchy from top category "A" to leaf category "D" by following the intermediate categories "C" and "D." Therefore, in the bi-gram feature composition, follow the top-down manner to create bi-gram features. Features from the above single session are "A00001-B00001," "B00001-C00001," and "C00001-D00001" based on the label augmented features are "A00001-B00001-label," "B00001-C00001-label," and "C00001-D00001-label." Based on Uni-and Bi-grambased approaches, we can generate more features for the gender prediction system.

Session augmentation with context window
It is important to analyze the behavior that how similar the current session with surrounding sessions. To address the contextual information of this task, we create a history based on window size. The context windowbased approach augments the current session. A session space S = {s 1 , s 2 , s 3 , ..., s m } is a set of sessions from training dataset. In the context window approach, we set a window size from a certain session to generate history based on surrounding sessions. For instance, if we set window size = 3, then it creates history from current position to its two previous sessions and to two next sessions. Figure 1 shows an example of setting context window among the sessions. In this figure, window size = 3 is set on session 3 to generate the history incorporating from two previous sessions and two more next sessions. In Fig. 1, term t i = t 1 , t 2 , ..., t n represents corresponding session IDs of a single session. We build the history of a certain session if it does satisfy the following conditions, Algorithm 1 shows the session augmentation based on context window to create history of a certain session. In this Algorithm, augment the current session's features with previous and next session's features based on window size.

Session augmentation with identical hierarchy
In this section, we will discuss creation of hierarchy and further analysis based on the hierarchy which leads to create identical hierarchy from all the product log view of training data. First, we generate the hierarchy from training data where Fig. 2 shows the architecture of the hierarchy. In F,ig. 2 the IDs starting with letter "A" are top categories or nodes of the hierarchy. The IDs starting with letters "B" and "C" are the intermediate categories or nodes and finally the IDs starting with letter "D" are leaf categories or nodes. In the training data, it has 22,440 hierarchies. Top nodes are consisting of 11 distinct categories where intermediate nodes starting with "B" and "C" are consisting of 91 and 441 categories respectively. And the leaf nodes have 36,122 different categories that represent the target products.

Mapping identical hierarchy
Once we generated the hierarchy, we then enumerate the identical hierarchy from intermediate nodes or categories. Here, the term identical hierarchy denotes if a hierarchy construct with parent and child category and the hierarchy appears only in a certain category. From the definition of identical hierarchy, it seems that the identical hierarchies are only possible to be created from categories starting with "B" and "C" and not from "A" and "D" since they are free from parent and child categories respectively. For instance, a given product IDs of a single session "A00003/ B00008/C00026/D00070/" and can be represented as in Fig. 3, where "A00003" is the parent category of "B00008," "B00008" is the parent category of "C000026," and "C00026" is the parent of "D00070." Here, first we determine the identical categories if and only if a certain product IDs that appears in certain gender category like "male" or "female." For a certain top, intermediate, and leaf-level categories, the label weight is assigned with a numeric value between 1 and 0 for overlapping and nonoverlapping categories. If categories in intermediate-level are non-overlapped with gender categories, we then determine as an identical category and extract all possible parent-and child-list from that identical categories which are denoted as identical hierarchy of a certain gender's label. Figure 4 shows an example of session augmentation with identical hierarchy. Figure 4a represents a hierarchy of "A00003/B00008/C00026/D00070/" where "B00008" is an identical category based on training  data. Figure 4b shows the parent-and child-list from identical category "B00008" which are extracted from training samples. Finally, Fig. 4c shows the child-list from "B00008" which are embedded to the graph. In Fig. 4a, we can represent the hierarchy as in stated in Fig. 3. In contrast, the session is augmented after the child categories are embedded in Fig. 4c and can be represented as in Fig. 5.

Session to vector generation
In the machine learning workbench, term to vector generation [3,4] plays a very significant role to boost the system performance. After the session augmentation process, we then assign a weight of each term (t i ) in a certain session (s j ) using binary weighting approach where term weight is assigned a numeric value between 1 and 0 for overlapping and non-overlapping terms in a certain label (c k ) and cab ne denoted as, We then normalize the session using cosine normalization and is denoted as, Besides, we also introduce a simple weighting approach based on term frequency (TF). We calculate the number of terms appear in the session and normalize as,

Classifier
Machine learning method constructs a classification model to predict the category of new test documents by learning the statistic of training data. In the machine learning workbench, support vector machines (SVMs) are achieved great success in classification problem and are considered one of the most robust and accurate methods among all well-known classification algorithms. In order to evaluate the effectiveness of the proposed ML-based gender prediction system, we use the liblinear 1 package, a library for large linear classification. It supports L2regularized classifiers, L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR), L2-loss linear SVM and logistic regression (LR) for SVM classifier etc. The parameter -c was set to 1.0, which is considered as a default setting in this toolbox.

Results
In this section, we provide empirical evidence for the effectiveness of proposed gender prediction system using different session augmentation-based approaches that can perform to compare the results of our proposed machine learning-based classification system for gender prediction based on E-commerce data.

Experimental dataset
The data in this experiment for customer gender prediction is used which is provided by financing and promoting technology (FPT) 2 group and the dataset 3 is divided into separate training and test sets-trainingData.csv and testData.csv, respectively. Training contains 15,000 records which correspond to product viewing logs. In 15,000 sessions, 11,703 and 3297 sessions are in "female" and "male" category respectively which is quite unbalanced dataset. A single log is composed of four columns, separated by commas where the first column is a session ID. The second and third columns correspond to a session start time and session end time, respectively. The last column contains a list of product IDs. Consecutive product IDs are separated by semicolons. Besides, a trai-ningLabels.csv file is available which contains label of corresponding sessions. Each product ID can be decomposed into four different IDs which are separated by slashes. The IDs starting with letter "A" are the most general categories and those starting with "D"' correspond to individual products. The IDs which start with "B" and "C" are associated with subcategories and subsubcategories, respectively.

Cross validation
To split the data-set, we adopt n-fold cross-validation problem where n = 3, 5, 10. For instance, three-fold cross-validation problem, we randomly split the data into three different folds and each turn during training one-fold is used as test and others are training. For model validation, cross-validation technique is very effective for assessing how the results of a statistical analysis will generalize to an independent dataset.

Performance measurement
The standard methods used to judge the performance of a gender prediction are precision, recall, and the F1 measure [8,26]. These measures are defined based on a contingency table of predictions for a target category c k . The precision P(C k ), recall R(C k ), and the F1 measure F 1 (C k ) are defined as in Eqs. 4-6 respectively: TP(C k ) is the set of test sessions correctly classified to the category C k , FP(C k ) is the set of test sessions incorrectly classified to the category, FN(C k ) is the set of test sessions wrongly rejected, and TN(C k ) is the set of test sessions correctly rejected. To compute the average performance, we used macro-average, micro-average, and overall accuracy. The macro-average of precision (P M ), recall (R M ), and the F 1 measure ( F M 1 ) of the class space are computed as in Eqs. 7-9 respectively: Therefore, the micro-average of precision (P μ ), recall (R μ ), and the F 1 measure (F μ 1 ) of the class space are computed as in Eqs. 10-12 respectively:

System performances
In this section, we judge the system performances on different weighting approaches as well as session augmentation approaches, including unique IDs decomposition, context window generation, and identical hierarchy with n-fold cross-validation. Table 1 shows the effect of adding uni-gram and bigram-based unique IDs decomposition from each session where the term weight is computed based on binary approach. Table 2 shows performance when context window (CW) size is set to CW = 3. The performances based on context shows an improvement in terms of precision, recall, and f-score over the performances based on only adding unique IDs in Table 1.

The effect of different term weighting approaches
Here, we show the performances of different weighting approaches with different cross-validation size. Tables 3, 4, and 5 show the performances based on binary weighting approach which is stated in Eq. 1, where different cross-validation are taken into account to judge the performance of the dataset. Tables 6, 7, and 8 show the performances based on normalized binary weighting approach which is stated in Eq. 2, where different cross-validations are taken into account to judge the performance of the dataset.

Performance based on normalized term
frequency-based approach Tables 9, 10, and 11 show the performances based on normalized term frequencybased (norm-TF) weighting approach which is stated in Eq. 3, where different cross-validation are taken into account to judge the performance of the dataset. Table 12 shows performance when identical hierarchies are added to augment the session.

Discussions
All the results in Section 5.4 show that different combination of session augmentation approaches as well as different term weighting approaches can significantly improve the system performance. Using unique IDs decomposition in Table 1, we achieved 96.81% and 88.89% in terms of F-score in female and male category respectively and 92.65% and 95% in terms macro and micro Fscore. Table 2 shows an improvement when we apply context window-based history generation approach. Table 2 achieved 97.35% and 90.46% in terms of F-score in female and male category respectively and 93.91% and 95.85% in terms macro and micro F-score. Further improvement is also seen when we apply different term weighting approaches. Table 12 shows a significant improvement using identical hierarchy. Since the female and male category session are quite unbalanced but it shows identical hierarchy can significantly improve the categorical performance. Figures 6, 7, and 8 show the performance of each fold in three-fold, five-fold, and ten-fold cross validation respectively and the performances are compared based on F-score. In Fig. 6, 15,000 training samples are randomly divided into three-fold where in each turn 10,000 is used for training and 5000 for test. In contrast for in terms of five-fold in Fig. 7, where in each turn 12,000 and 3000 samples are used for training and text respectively. Finally, in Fig. 8 for tenfold cross validation, samples are randomly divided into ten-fold in and for each iteration 13,500 is used for training and the remaining 1500 is used as test data. Figures 6, Table 1 Performance measure with unique IDs decomposition using cross validation = 3   Table 4 Results with binary approach with cross validation = 5  Table 5 Results with binary approach with cross validation = 10  Table 6 Results with norm-binary approach with cross validation = 3  Table 2 Performance measure with context window using cross validation = 3   Table 8 Results with norm-TF with cross validation = 10  Table 9 Results with norm-TF with cross validation = 3  Table 10 Results with norm-TF with cross validation = 5  Table 11 Results with norm-TF with cross validation = 10  Table 12 Performance measure with identical hierarchy using cross validation = 3 7, and 8 show that the results are very consistent even in different cross validation settings. It also shows that for each fold in three-fold to five-fold and then to ten-fold, the performances are a bit improved since five-fold and ten-fold have more training samples than three-fold.

Conclusions
In this implementation, we investigated the effectiveness of proposed session augmentation approaches, including unique IDs decomposition, context window, and identical hierarchy approaches with other different term weighing approaches using SVM classifier to address the ML-based gender prediction system. First, we implement a unique IDs decomposition-based session augmentation approach to create good features and conducted our experiments with this implementation. We then implemented a context window-based session augmentation approach to create history based on surrounding sessions. We then incorporate this approach with unique IDs decomposition and improve the system performance. Finally, we introduced to extract identical hierarchy and that is further improve the system performance. Besides, the simple binary weighting approach also shows the effectiveness in classification task. Therefore, the combination of all session augmentation approach with SVM classifier can significantly improve the performance of gender prediction system.