Customer gender prediction system on hierarchical E-commerce data

Khan, Mohammad Masud; Sohrab, Mohammad Golam; Yousuf, Mohammad Abu

doi:10.1186/s43088-020-0035-7

Research
Open access
Published: 17 March 2020

Customer gender prediction system on hierarchical E-commerce data

Mohammad Masud Khan¹^na1,
Mohammad Golam Sohrab²^na1 &
Mohammad Abu Yousuf¹

Beni-Suef University Journal of Basic and Applied Sciences volume 9, Article number: 10 (2020) Cite this article

2796 Accesses
2 Citations
Metrics details

Abstract

Background

E-commerce services provide online shopping sites and mobile applications for small and medium sellers. To provide more efficient buying and selling experiences, a machine learning system can be applied to predict the optimal organization and display of products that maximize the chance of bringing useful information to user that facilitate the online purchases. Therefore, it is important to understand the relevant products for a gender to facilitate the online purchases. In this work, we present a statistical machine learning (ML)-based gender prediction system to predict the gender “male” or “female” from transactional E-commerce data. We introduce different sets of learning algorithms including unique IDs decomposition, context window-based history generation, and extract identical hierarchy from training set to address the gender prediction classification system from online transnational data.

Results

The experiment result shows that different feature augmentation approaches as well as different term or feature weighting approaches can significantly enhance the performance of statistical machine learning-based gender prediction system.

Conclusions

This work presents a ML-based implementational approach to address E-commerce-based gender prediction system. Different session augmentation approaches with support vector machines (SVMs) classifier can significantly improve the performance of gender prediction system.

1 Background

This work presents research on implementing a machine learning (ML)-based automatic text classification (ATC) [1,2,3,4] system for gender prediction on transnational E-commerce data. E-commerce services provide online shopping sites and mobile applications for small and medium sellers. To provide more efficient buying and selling experiences, a machine learning system can be applied to predict the optimal organization and display of products that maximize the chance of bringing useful information to user that facilitate the online purchases. In this work, we address this problem to predict the gender “male” or “female” from browsing history using statistical machine learning technique. A set of different approaches are considered to enhance the gender prediction system. We introduce different feature augmentation approaches including unique IDs decomposition, context window-based history generation, and extract identical hierarchy from training set to address the gender prediction classification system from online transnational data. The experiment result shows that different feature augmentation approaches as well as different term or feature weighting approaches can significantly enhance the performance of statistical ML-based gender prediction system.

In E-commerce, business-to-business-to-customer (B2B2C) runs several services that provide online shopping sites and mobile applications for small and medium sellers. Transaction data, such as product browsing and purchasing activities, from buyer, and product portfolio, from seller, can be aggregated, to provide more efficient buying and selling experiences. For example, statistical machine learning (SML) techniques can be applied to predict the optimal organization and display of products that maximize the chance of bringing useful information to user, facilitate the online purchases.

The primary motivation of exploiting the session augmentation-based approaches can be attributed to two main properties. First, to create a more effective features that can enhance the system performance from product viewing transaction sessions by augmenting the sessions’ S = {s₁, s₂, ..., s _j} features. Second, is to create different simple term or feature weighing approaches that can generate a more effective classifier and can further boost the system performance. The term or feature weight is the degree of importance of term or feature t_i in each transaction session s_j. The term weighting approach plays a very significant role to enhance automatic text classification (TC). Therefore, an effective term or feature weighting approach can generate more information-rich terms and assign appropriate weighting values to the terms.

Most of the works in SML-based approaches are based on either document- or class-indexing which is incorporated with document space [2] or category space [3, 4]. In this dataset, the transactional sessions or log sessions are free from either in document or in class space. Therefore, to deal with such data for statistical classification system, we use binary approach, cosine binary approach to normalize the features of each session, and finally normalize term frequency-based approach to address the system.

In this work, each session of sessions from transaction data is modeled as feature vectors, extracted from the session product IDs. The classification task for gender prediction is a binary classification problem, where a session is labeled as “male” or “female” category. The trainable classifier is expected to learn the patterns, by identifying relevant feature values which are most correlated with the classes “male” or “female.” When a new session or log file is given to the system, the learned patterns are used to classify each session into either a “male” or “female” session and give it a certain score value between “0” and “1.” Our goal is to show that an effective gender prediction system can be created relying solely on different session augmentation and simple weighting approaches to enhance the classification task. The session augmentation with unique IDs decomposition is capable to produce good classification score. Session augmentation with context window approach along with unique IDs decomposition can also boost the system performance. Finally, session augmentation with identical hierarchy incorporated with all together can further boost the system performance. Therefore, these approaches are effective to improve the SML-based gender prediction system.

Recently, many experiments have been conducted using different term weighting approaches [1, 3,4,5,6,7] to address the classification task as a statistical method. Document-indexing-based and four fundamental information-element-based [3, 4] weighting approaches are considered the most popular term weighting method in ATC.

Recently, many experiments have been conducted using a document- and class-indexing-based term weighting approach to address the classification task as a statistical method [2, 7,8,9,10,11]. TF.IDF is the most popular term weighting method in successfully performing the ATC task and document-indexing [2]. Salton Buckley [11] discussed many term weighting approaches in the information retrieval (IR) [12,13,14,15,16] field and found that normalized TF.IDF is the best document weighting function. Therefore, TF.IDF is considered as one of the most standard weighting approaches in statistical machine learning-based approaches, especially in text classification. In contrast, recently Sohrab [3, 4] proposed class-indexing-based (TF.IDF.ICS_δF) indexing, a subset of documents from the global document space D = {d₁, d₂. . . d_n} is allocated to a certain class c_k where (k = 1, 2, . . . m)) according to their topics in order to create a boundary line vector space in the training procedure. Therefore, the class space is defined as C = {(d₁₁, d₁₂, . . . d_1n) C₁, (d₂₁.d₂₂, . . . d_2n) C₂, . . ., (d_m1, d_m2, . . . d_mn) Cm} where a set of documents with same topics is assigned to a certain class c_k. With class-indexing-based term weighing approach they have outperformed over all traditional weighting approaches.

In the last few years, researchers have attempted to improve the performance of TC by exploiting statistical classification approaches and machine learning techniques, including probabilistic Bayesian models [17], support vector machines (SVMs) [18, 19], decision trees (Lewis and Ringuette) [17], Rocchio classifiers [20], and multivariate regression models [21]. Among them, SVMs-based classifier achieved great success in classification problem. Therefore, we adopt the SVMs as a classifier to learn and predict the gender prediction system.

2 Methods

The proposed ML-based gender prediction system follows the steps, including (i) session augmentation approach to generate candidate features, (ii) session to vector generation, and (iii) classifiers.

2.1 Session augmentation approach

Session augmentation approach can be decomposed into three different representations including (i) session augmentation with unique IDs decomposition, (ii) session augmentation with context window, and (iii) session augmentation with identical hierarchy. In subsequent sections, we will clarify how these augmentations are produced.

2.1.1 Session augmentation with unique IDs decomposition

In the statistical-based classification method [13, 15, 22,23,24,25], it is important to generate good features from text and then apply term weighting approach to generate text to vector which is an input of a classifier to address classification problem. An example of session or single product viewing log from training dataset is composed of four columns and can be read as, u10001, 2014-11-14 00:02:14, 2014-11-14 00:02:20, A00001/B00001/C00001/D00001/ where u10001 is session ID, 2014-11-14 00:02:14 and 2014-11-14 00:02:20 correspond to a session start- and end-time respectively, and a list of product IDs separated by back slash are as A00001/B00001/C00001/D00001.

In case of multiple products view of a single session, the product list is separated with semicolon as, u10001, 2014-11-14 00:02:14, 2014-11-14 00:02:20, A00001/B00001/C00001/D00001/; A00002/B00002/C00002/D00002/.

A distribution of product IDs in the dataset is decomposed into two different combinations uni-gram and bi-gram compositions.

Uni-gram-based feature composition

For a given product IDs of a single session “A00001/B00001/C00001/D00001/,” first generate four different features based on uni-gram, i.e., “A00001,” “B00001,” “C00001,” and “D00001.” Since the system is binary classification task to predict male or female label. To adding more features, augment the sessions by product IDs with merging label “A00001-label,” “B00001-label,” “C00001-label,” and “D00001-label” where “label” is indicating a certain session’s label which can be “female” or “male” category.

Bi-gram-based feature composition

The product IDs follow the hierarchy from top category “A” to leaf category “D” by following the intermediate categories “C” and “D.” Therefore, in the bi-gram feature composition, follow the top-down manner to create bi-gram features. Features from the above single session are “A00001-B00001,” “B00001-C00001,” and “C00001-D00001” based on the label augmented features are “A00001-B00001-label,” “B00001-C00001-label,” and “C00001-D00001-label.” Based on Uni- and Bi-gram-based approaches, we can generate more features for the gender prediction system.

2.1.2 Session augmentation with context window

It is important to analyze the behavior that how similar the current session with surrounding sessions. To address the contextual information of this task, we create a history based on window size. The context window-based approach augments the current session. A session space S = {s₁, s₂, s₃, ..., s_m} is a set of sessions from training dataset. In the context window approach, we set a window size from a certain session to generate history based on surrounding sessions. For instance, if we set window size = 3, then it creates history from current position to its two previous sessions and to two next sessions. Figure 1 shows an example of setting context window among the sessions. In this figure, window size = 3 is set on session 3 to generate the history incorporating from two previous sessions and two more next sessions. In Fig. 1, term t_i = t₁, t₂, ..., t_n represents corresponding session IDs of a single session. We build the history of a certain session if it does satisfy the following conditions,

Algorithm 1 shows the session augmentation based on context window to create history of a certain session. In this Algorithm, augment the current session’s features with previous and next session’s features based on window size.

2.1.3 Session augmentation with identical hierarchy

In this section, we will discuss creation of hierarchy and further analysis based on the hierarchy which leads to create identical hierarchy from all the product log view of training data. First, we generate the hierarchy from training data where Fig. 2 shows the architecture of the hierarchy. In F,ig. 2 the IDs starting with letter “A” are top categories or nodes of the hierarchy. The IDs starting with letters “B” and “C” are the intermediate categories or nodes and finally the IDs starting with letter “D” are leaf categories or nodes. In the training data, it has 22,440 hierarchies. Top nodes are consisting of 11 distinct categories where intermediate nodes starting with “B” and “C” are consisting of 91 and 441 categories respectively. And the leaf nodes have 36,122 different categories that represent the target products.

Mapping identical hierarchy

Once we generated the hierarchy, we then enumerate the identical hierarchy from intermediate nodes or categories. Here, the term identical hierarchy denotes if a hierarchy construct with parent and child category and the hierarchy appears only in a certain category. From the definition of identical hierarchy, it seems that the identical hierarchies are only possible to be created from categories starting with “B” and “C” and not from “A” and “D” since they are free from parent and child categories respectively. For instance, a given product IDs of a single session “A00003/B00008/C00026/D00070/” and can be represented as in Fig. 3, where “A00003” is the parent category of “B00008,” “B00008” is the parent category of “C000026,” and “C00026” is the parent of “D00070.” Here, first we determine the identical categories if and only if a certain product IDs that appears in certain gender category like “male” or “female.” For a certain top, intermediate, and leaf-level categories, the label weight is assigned with a numeric value between 1 and 0 for overlapping and non-overlapping categories. If categories in intermediate-level are non-overlapped with gender categories, we then determine as an identical category and extract all possible parent- and child-list from that identical categories which are denoted as identical hierarchy of a certain gender’s label. Figure 4 shows an example of session augmentation with identical hierarchy. Figure 4a represents a hierarchy of “A00003/B00008/C00026/D00070/” where “B00008” is an identical category based on training data. Figure 4b shows the parent- and child-list from identical category “B00008” which are extracted from training samples. Finally, Fig. 4c shows the child-list from “B00008” which are embedded to the graph. In Fig. 4a, we can represent the hierarchy as in stated in Fig. 3. In contrast, the session is augmented after the child categories are embedded in Fig. 4c and can be represented as in Fig. 5.

2.2 Session to vector generation

In the machine learning workbench, term to vector generation [3, 4] plays a very significant role to boost the system performance. After the session augmentation process, we then assign a weight of each term (t_i) in a certain session (s_j) using binary weighting approach where term weight is assigned a numeric value between 1 and 0 for overlapping and non-overlapping terms in a certain label (c_k) and cab ne denoted as,

$$ w\left({t}_i\right)=\Big\{{}_{0,\kern0.75em \mathrm{otherwise}}^{1, if\ \mathrm{term}\ \mathrm{appears}\ \mathrm{in}\ \mathrm{a}\ \mathrm{certain}\ \mathrm{category},{t}_i\boldsymbol{\upvarepsilon} \kern0.50em {c}_k} $$

(1)

We then normalize the session using cosine normalization and is denoted as,

$$ {W}^{\mathrm{norm}}\left({t}_i\right)=\frac{W\left({t}_i\right)}{\sqrt{\sum_{t_i\ \varepsilon\ {s}_j}{\left[W\left({t}_i\right)\right]}^2}}, $$

(2)

Besides, we also introduce a simple weighting approach based on term frequency (TF). We calculate the number of terms appear in the session and normalize as,

$$ {W}^{TF}\left({t}_i\right)=\frac{TF\left({t}_i\right)}{TF\left({t}_i\right)+1}, $$

(3)

2.3 Classifier

Machine learning method constructs a classification model to predict the category of new test documents by learning the statistic of training data. In the machine learning workbench, support vector machines (SVMs) are achieved great success in classification problem and are considered one of the most robust and accurate methods among all well-known classification algorithms. In order to evaluate the effectiveness of the proposed ML-based gender prediction system, we use the liblinear^{Footnote 1} package, a library for large linear classification. It supports L2-regularized classifiers, L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR), L2-loss linear SVM and logistic regression (LR) for SVM classifier etc. The parameter –c was set to 1.0, which is considered as a default setting in this toolbox.

3 Results

In this section, we provide empirical evidence for the effectiveness of proposed gender prediction system using different session augmentation-based approaches that can perform to compare the results of our proposed machine learning-based classification system for gender prediction based on E-commerce data.

3.1 Experimental dataset

The data in this experiment for customer gender prediction is used which is provided by financing and promoting technology (FPT)^{Footnote 2} group and the dataset^{Footnote 3} is divided into separate training and test sets—trainingData.csv and testData.csv, respectively. Training contains 15,000 records which correspond to product viewing logs. In 15,000 sessions, 11,703 and 3297 sessions are in “female” and “male” category respectively which is quite unbalanced dataset. A single log is composed of four columns, separated by commas where the first column is a session ID. The second and third columns correspond to a session start time and session end time, respectively. The last column contains a list of product IDs. Consecutive product IDs are separated by semicolons. Besides, a trainingLabels.csv file is available which contains label of corresponding sessions. Each product ID can be decomposed into four different IDs which are separated by slashes. The IDs starting with letter “A” are the most general categories and those starting with “D”’ correspond to individual products. The IDs which start with “B” and “C” are associated with subcategories and sub-subcategories, respectively.

3.2 Cross validation

To split the data-set, we adopt n-fold cross-validation problem where n = 3, 5, 10. For instance, three-fold cross-validation problem, we randomly split the data into three different folds and each turn during training one-fold is used as test and others are training. For model validation, cross-validation technique is very effective for assessing how the results of a statistical analysis will generalize to an independent dataset.

3.3 Performance measurement

The standard methods used to judge the performance of a gender prediction are precision, recall, and the F1 measure [8, 26]. These measures are defined based on a contingency table of predictions for a target category c_k. The precision P(C_k), recall R(C_k), and the F1 measure F₁(C_k) are defined as in Eqs. 4–6 respectively:

$$ P\left({C}_k\right)=\frac{TP\left({C}_k\right)}{TP\left({C}_k\right)+ FP\left({C}_k\right)} $$

(4)

$$ R\left({C}_k\right)=\frac{TP\left({C}_k\right)}{TP\left({C}_k\right)+ FN\left({C}_k\right)} $$

(5)

$$ {F}_1\left({C}_k\right)=\frac{2.P\left({C}_k\right).R\left({C}_k\right)}{P\left({C}_k\right)+R\left({C}_k\right)}=\frac{2. TP\left({C}_k\right)}{2 TP\left({C}_k\right)+ FP\left({C}_k\right)+ FN\left({C}_k\right)}, $$

(6)

TP(C_k) is the set of test sessions correctly classified to the category C_k, FP(C_k) is the set of test sessions incorrectly classified to the category, FN(C_k) is the set of test sessions wrongly rejected, and TN(C_k) is the set of test sessions correctly rejected. To compute the average performance, we used macro-average, micro-average, and overall accuracy. The macro-average of precision (P^M), recall (R^M), and the F₁ measure ($ {F}_1^M $) of the class space are computed as in Eqs. 7–9 respectively:

$$ {P}^M=\frac{1}{m}{\sum}_{k=1}^mP\left({C}_k\right) $$

(7)

$$ {R}^M=\frac{1}{m}{\sum}_{k=1}^mR\left({C}_k\right) $$

(8)

$$ {F}_1^M=\frac{1}{m}{\sum}_{k=1}^m{F}_1\left({C}_k\right) $$

(9)

Therefore, the micro-average of precision (P^μ), recall (R^μ), and the F₁ measure ($ {F}_1^{\mu } $) of the class space are computed as in Eqs. 10–12 respectively:

$$ {P}^{\mu }=\frac{\sum_{k=1}^m TP\left({C}_k\right)}{\sum_{k=1}^m\left( TP\left({C}_k\right)+ FP\left({C}_k\right)\right)} $$

(10)

$$ {R}^{\mu }=\frac{\sum_{k=1}^m TP\left({C}_k\right)}{\sum_{k=1}^m\left( TP\left({C}_k\right)+ FN\left({C}_k\right)\right)} $$

(11)

$$ {F}_1^{\mu }=\frac{2.{P}^{\mu }.{R}^{\mu }}{P^{\mu }+{R}^{\mu }} $$

(12)

3.4 System performances

In this section, we judge the system performances on different weighting approaches as well as session augmentation approaches, including unique IDs decomposition, context window generation, and identical hierarchy with n-fold cross-validation.

3.4.1 The effect of unique IDs decomposition

Table 1 shows the effect of adding uni-gram and bi-gram-based unique IDs decomposition from each session where the term weight is computed based on binary approach.

Table 1 Performance measure with unique IDs decomposition using cross validation = 3

Full size table

3.4.2 The effect of context window to create history

Table 2 shows performance when context window (CW) size is set to CW = 3. The performances based on context shows an improvement in terms of precision, recall, and f-score over the performances based on only adding unique IDs in Table 1.

Table 2 Performance measure with context window using cross validation = 3

Full size table

3.4.3 The effect of different term weighting approaches

Here, we show the performances of different weighting approaches with different cross-validation size.

Performance based on binary weighting approach

Tables 3, 4, and 5 show the performances based on binary weighting approach which is stated in Eq. 1, where different cross-validation are taken into account to judge the performance of the dataset.

Table 3 Results with binary approach with cross validation = 3

Full size table

Table 4 Results with binary approach with cross validation = 5

Full size table

Table 5 Results with binary approach with cross validation = 10

Full size table

Performance based on normalized binary weighting approach

Tables 6, 7, and 8 show the performances based on normalized binary weighting approach which is stated in Eq. 2, where different cross-validations are taken into account to judge the performance of the dataset.

Table 6 Results with norm-binary approach with cross validation = 3

Full size table

Table 7 Results with norm-TF with cross validation = 5

Full size table

Table 8 Results with norm-TF with cross validation = 10

Full size table

Performance based on normalized term frequency-based approach

Tables 9, 10, and 11 show the performances based on normalized term frequency-based (norm-TF) weighting approach which is stated in Eq. 3, where different cross-validation are taken into account to judge the performance of the dataset.

Table 9 Results with norm-TF with cross validation = 3

Full size table

Table 10 Results with norm-TF with cross validation = 5

Full size table

Table 11 Results with norm-TF with cross validation = 10

Full size table

3.4.4 The effect of identical hierarchy

Table 12 shows performance when identical hierarchies are added to augment the session.

Table 12 Performance measure with identical hierarchy using cross validation = 3

Full size table

3.5 Discussions

All the results in Section 5.4 show that different combination of session augmentation approaches as well as different term weighting approaches can significantly improve the system performance. Using unique IDs decomposition in Table 1, we achieved 96.81% and 88.89% in terms of F-score in female and male category respectively and 92.65% and 95% in terms macro and micro F-score. Table 2 shows an improvement when we apply context window-based history generation approach. Table 2 achieved 97.35% and 90.46% in terms of F-score in female and male category respectively and 93.91% and 95.85% in terms macro and micro F-score. Further improvement is also seen when we apply different term weighting approaches. Table 12 shows a significant improvement using identical hierarchy. Since the female and male category session are quite unbalanced but it shows identical hierarchy can significantly improve the categorical performance. Figures 6, 7, and 8 show the performance of each fold in three-fold, five-fold, and ten-fold cross validation respectively and the performances are compared based on F-score. In Fig. 6, 15,000 training samples are randomly divided into three-fold where in each turn 10,000 is used for training and 5000 for test. In contrast for in terms of five-fold in Fig. 7, where in each turn 12,000 and 3000 samples are used for training and text respectively. Finally, in Fig. 8 for ten-fold cross validation, samples are randomly divided into ten-fold in and for each iteration 13,500 is used for training and the remaining 1500 is used as test data. Figures 6, 7, and 8 show that the results are very consistent even in different cross validation settings. It also shows that for each fold in three-fold to five-fold and then to ten-fold, the performances are a bit improved since five-fold and ten-fold have more training samples than three-fold.

4 Conclusions

In this implementation, we investigated the effectiveness of proposed session augmentation approaches, including unique IDs decomposition, context window, and identical hierarchy approaches with other different term weighing approaches using SVM classifier to address the ML-based gender prediction system. First, we implement a unique IDs decomposition-based session augmentation approach to create good features and conducted our experiments with this implementation. We then implemented a context window-based session augmentation approach to create history based on surrounding sessions. We then incorporate this approach with unique IDs decomposition and improve the system performance. Finally, we introduced to extract identical hierarchy and that is further improve the system performance. Besides, the simple binary weighting approach also shows the effectiveness in classification task. Therefore, the combination of all session augmentation approach with SVM classifier can significantly improve the performance of gender prediction system.

Availability of data and materials

Dataset used for supporting the conclusions of this article are available from the public data repository at the website of https://knowledgepit.ml/pakdd15-data-mining-competition/. Dataset is also available upon request to corresponding author.

Notes

Available at https://www.csie.ntu.edu.tw/~cjlin/liblinear/
Available at https://www.fpt.com.vn/en/
Available at https://knowledgepit.ml/pakdd15-data-mining-competition/

Abbreviations

ATC:: Automatic text classification
B2B2C:: Business-to-business-to-customer
CW:: Context window
FPT:: Financing and promoting technology
IR:: Information retrieval
LR:: Logistic regression
ML:: Machine learning
SML:: Statistical machine learning
SVM:: Support vector machine
TC:: Text classification
TF.IDF.ICS_δF:: Term frequency inverse document frequency incorporated with inverse class space density frequency
TFIDF:: Term frequency inverse document frequency
VSM:: Vector space model
WS:: Window size

References

Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM symposium on applied computing, pp 784–788
Salton G, McGill MJ (1983) Introduction to modern information retrieval
Google Scholar
Sohrab MG, Ren F (2012) Class-indexing: the effectiveness of class space density in high and low-dimensional vector space for text classification. In: 2^nd International Conference of IEEE CCIS, pp 2034–2042
Google Scholar
Sohrab MG, Ren F (2013) Class-indexing-based term weighting for automatic text classification. Inform Sci 236:109–125
Article Google Scholar
Flora S, Agus T (2011) Experiments in term weighting for novelty mining. Expert Syst Appl 38(11):14094–14101
Google Scholar
Kansheng S, Jie H, Hai-tao L, Nai-tong Z, Wen-tao S (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Posts Telecomm 18(1):131–135
Google Scholar
Ko Y, Seo J (2009) Text classification from unlabeled documents with bootstrapping and feature projection techniques. Inform Processing Manag 45(1):70–83
Article Google Scholar
Fuhr N, Buckley C (1991) A probabilistic learning approach for document indexing. In: Information Sciences, vol 9, pp 223–248
Google Scholar
Liu Y, Loh H, Sun A (2009) Imbalanced text classification: a term weighting approach. Expert Systems with Applications 36:690–701
Article Google Scholar
Salton G (1975) A theory of indexing
Book Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing, by. Assoc Comp Mach 18:613–620
Google Scholar
Joachims T (2001) A statistical learning model of text classification for support vector machines. In: SIGIR-2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp 128–136
Chapter Google Scholar
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Document 28:11–21
Article Google Scholar
Korfhage RR (1997) Information storage and retrieval. Wiley
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text classification. Exp Syst Appl 38(10):12708–12716
Article Google Scholar
Singhal A (2001) Modern information retrieval: a brief overview, by. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24:35–43
Google Scholar
Lewis DD, Ringuette M (1994) A Comparison of two learning algorithms for text categorization. In: Proceedings of the third annual symposium on document analysis and information retrieval, pp 81–93
Google Scholar
Godbole S, Sarawagi S, Chakrabarti S (2002) Scaling multi-class support vector machine using inter-class confusion. In: In Proceedings of the 8th ACM international conference on knowledge discovery and data mining, pp 513–518
Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European conference on machine learning. Springer Verlag, Heidelberg, pp 137–142
Google Scholar
Lewis DD et al (1996) Training algorithms for linear text classifiers. In: In Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval
Google Scholar
Schutze H, Hull D, Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. In: In Proceedings of the 18th ACM International Conference on Research and Development in Information Retrieval, pp 613–620
Google Scholar
Davil L (2008) Advanced data mining techniques, Olsen, and Dursun Delen
Google Scholar
James G, Witten D, Hastie T, Tibshirani R (2013) Introduction to statistical learning, with Applications in R
Book Google Scholar
Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems). 2012.
Chapter Google Scholar
Wu X, Kumar V et al (2008) Top 10 algorithms in data mining. Knowledge Inform Syst 14:1–37
Article Google Scholar
Yang Y (1999) An evaluation of statistical approaches to text categorization. J Inform Retr 1(1/2):67–88
Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their valuable comments.

Funding

This research has been carried out with funding from AIRC/AIST and results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). The corresponding author is funded by the NEDO project to design, implement, analysis, and in writing the manuscript.

Author information

Mohammad Masud Khan and Mohammad Golam Sohrab contributed equally to this work.

Authors and Affiliations

Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
Mohammad Masud Khan & Mohammad Abu Yousuf
National Institute of Advanced Industrial Science and Technology, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
Mohammad Golam Sohrab

Authors

Mohammad Masud Khan
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Golam Sohrab
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Abu Yousuf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MGS designed, implemented a set of different approaches and algorithms, carried out experiments and analysis to address the machine learning-based gender prediction systems, and drafted the manuscript. MMK carried out experiments to reproduce and analyze the results using cross-validation. MAY participated in research coordination. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Mohammad Golam Sohrab.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Khan, M.M., Sohrab, M.G. & Yousuf, M.A. Customer gender prediction system on hierarchical E-commerce data. Beni-Suef Univ J Basic Appl Sci 9, 10 (2020). https://doi.org/10.1186/s43088-020-0035-7

Download citation

Received: 04 July 2019
Accepted: 09 January 2020
Published: 17 March 2020
DOI: https://doi.org/10.1186/s43088-020-0035-7

Customer gender prediction system on hierarchical E-commerce data

Abstract

Background

Results

Conclusions

1 Background

2 Methods

2.1 Session augmentation approach

2.1.1 Session augmentation with unique IDs decomposition

Uni-gram-based feature composition

Bi-gram-based feature composition

2.1.2 Session augmentation with context window

2.1.3 Session augmentation with identical hierarchy

Mapping identical hierarchy

2.2 Session to vector generation

2.3 Classifier

3 Results

3.1 Experimental dataset

3.2 Cross validation

3.3 Performance measurement

3.4 System performances

3.4.1 The effect of unique IDs decomposition

3.4.2 The effect of context window to create history

3.4.3 The effect of different term weighting approaches

Performance based on binary weighting approach

Performance based on normalized binary weighting approach

Performance based on normalized term frequency-based approach

3.4.4 The effect of identical hierarchy

3.5 Discussions

4 Conclusions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords