UBC Graduate Research

Predicting Job Salaries from Text Descriptions Jackman, Shaun; Reid, Graham 2013-04-16

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


42591-predicting-salaries.pdf [ 551.91kB ]
JSON: 42591-1.0075767.json
JSON-LD: 42591-1.0075767-ld.json
RDF/XML (Pretty): 42591-1.0075767-rdf.xml
RDF/JSON: 42591-1.0075767-rdf.json
Turtle: 42591-1.0075767-turtle.txt
N-Triples: 42591-1.0075767-rdf-ntriples.txt
Original Record: 42591-1.0075767-source.json
Full Text

Full Text

Predicting Job Salaries from Text DescriptionsShaun Jackmansjackman@gmail.comUniversity of British ColumbiaGraham Reidgraham.d.reid@gmail.comUniversity of British ColumbiaAbstractAn online job listing web site has extensive data that is primarily unstructured textdescriptions of the posted jobs. Many listings provide a salary, but as many as halfdo not. For those listings that do not provide a salary, it is useful to predict a salarybased on the description of that job. We tested a variety of regression methods,including maximum-likelihood regression, lasso regression, artificial neural net-works and random forests. We optimized the parameters of each of these methods,validated the performance of each model using cross validation and compared theperformance of these methods on a withheld test data set.1 BackgroundThe data set is composed of 244,768 classified advertisements for jobs in the United Kingdom fromthe classified advertisement search web site, Adzuna. Each advertisement gives the title of the job, aplain text description of the job, the location, contract type (full-time or part-time), contract duration(permanent or contract), the name of the company, a category (such as customer service) and theannual salary. The majority of the data is found in the unstructured text description of the job. Thechallenge is to predict the salary of a job from its text description and other structured data. Thisdata set and problem were obtained from a machine learning competition hosted by Kaggle [1], aweb site dedicated to hosting machine learning competitions.2 Methods2.1 Data modelThe data is composed of a job title and description, which is unstructured plain text, as well asadditional structured data, the location, contract type and duration, company and category. Thestructured data were treated as categorical variables. The unstructured text features were modelledusing a binary feature bag-of-words model.The feature hashing trick [2] was used to limit the memory requirements to a fixed size. For eachword in the description, the hash value of that word is calculated using MurmurHash3 [3] and thecorresponding bit of the boolean feature vector is set to one. The size of the feature vector is thusfixed and does not depend on the variety of words present in the description. Some hash collisionsare expected, but should be relatively rare with a sufficiently large feature vector. We used a featurevector of size 224 bits, or 16 Mb.We used a log-transform of the output variable, the salary. Without this transform, for a linear modeleach word of the description would contribute a fixed amount to the salary, such as ?5,000 for theword ?manager?, or ??10,000 for the word ?intern?. This model does not make intuitive sense andcould even result in predicting negative salaries. The effect of the log transform is that each wordof the description instead contributes a multiplicative factor to the salary. For example, the word?manager? may cause the salary to be multiplied by 1.25, whereas the word ?intern? may cause thesalary to be multiplied by Comparison of regression methodsWe trained and tested a variety of regression methods, maximum-likelihood regression, lasso re-gression, artificial neural networks, dropout neural networks and random forests. We used crossvalidation to optimize the parameters of the models. We then compared the performance of eachoptimized model by comparing the mean absolute error of each model?s predictions on a withhelddata set, which was not used for training the models. The software package Vowpal Wabbit 7.2.0[4] was used for all of these regression methods except the random forest, for which we used thescikit-learn 0.13.1 RandomForestRegressor [5].2.3 Maximum-likelihood regressionWe used maximum-likelihood regression to predict salaries. We used a squared-error loss functionand employed online stochastic gradient descent to minimize the cost function, J(?) of equation 1.J(?) = (y ?X?)T (y ?X?) (1)2.4 Lasso regressionSince we have a very large number of features, namely the complete vocabulary of words used in thejob descriptions, we used lasso regression to select informative features. We used a squared-errorloss function with an L1 regularizer and employed online stochastic gradient descent to minimizethe cost function, J(?) of equation 2. The optimal value for the regularization parameter, ?, wasfound using cross validation.J(?) = (y ?X?)T (y ?X?) + ? ???1 (2)The learned feature weights of some common professions are shown in a word cloud [6] in Figure 1,where the size of the word is proportional to its weight.Figure 1: A word cloud showing the weights of common professions2.5 Neural network regressionTo test non-linear regression, we trained an artificial neural network with one hidden layer, a sig-moidal hidden-layer activation function and a linear output function. As with the previous regres-sors, we used a squared-error loss function, an L1 regularizer and employed online stochastic gra-dient descent to minimize the cost function, J(?). The variable ?j of equation 3 is the vector ofweights of the jth hidden-layer neuron, and ? is the weights of the output neuron.2J(?,?) = (y ? y?)T (y ? y?) + ??j??j?1 + ? ???1 (3)y?i = ui ? ?uij = tanh(Xi ??j)We alternated between optimizing the number of neurons and the regularization parameter, ?, us-ing cross validation, until it converged. We note that this method may have converged on a localminimum, and it is possible that a better global minimum may exist.2.6 Dropout neural network regressionWe also trained a dropout neural network [8], which randomly omits each neuron with probabilityone half at each step of the stochastic gradient descent optimization. Randomly omitting neuronsprevents overfitting the model by preventing neurons from co-adapting. The optimization of thenumber of neurons and regularization parameter was conducted as for the canonical neural network.2.7 Random forest regressionWe used random forest regression [9] to predict salaries. We used the scikit-learn [5] implementa-tion, RandomForestRegressor, which does not support sparse feature vectors. For this reason, it wasnecessary to use a subset of features. We selected words from the lasso model that had the strongestweights. Additional informative words were selected by separating the jobs into five quantiles bysalary and selecting words that gave the largest information gain (equation 4) [10], which is to say,those words that performed best at predicting the salary quintile of a job.IG(T, a) = H(T )?H(T |a) = H(T )?1?v=0|{x ? T |xa = v}||T |H({x ? T |xa = v}) (4)H(X) = E[I(X)] = E[? ln(P (X))] =5?i=1P (xi) logP (xi) = ?5?i=1niNlogniNThe depth of each tree was unlimited and stopped when only two elements remained in each leafnode. Each tree was trained on all the data, but on a random subset of features. The number offeatures used in each tree is the square root of the number of available features.3 Results3.1 Comparison of regression methodsWe optimized the hyperparameters of each model using cross validation, where 70% of the datawas used to train the model, and 15% of the data was withheld from training and used to validatethe performance of the trained model. A final 15% of the data was used to test the generalizedperformance of the optimized models. The mean absolute error of each model is shown in Table 1.Table 1: Mean absolute error (MAE) of various regression methodsMethod Training (?) Validation (?) Test (?)Maximum likelihood 3404 7195 6376Lasso 4124 6391 5945Neural network 3912 6284 5868Dropout NN 3947 6312 5876Random forest N/A? N/A? 5000?The random forest was run by my co-author, and the training and validation errors were not recorded.33.2 Lasso regressionThe L1 regularization parameter ? of the lasso regressor was optimized using cross validation, theresult of which is shown in Figure 2. The optimal value was found to be ? = 1.4 ? 10?7. Of the199,614 distinct words seen in the job descriptions, 88,087 had non-zero weights.Figure 2: Optimizing ?, the L1 regularization parameter of the lasso regressor3.3 Neural network regressionThe optimal number of neurons in the hidden layer of the artificial neural network was found usingcross validation, the result of which is shown in Figure 3. The optimal number of neurons was foundto be two. That the optimal number of neurons is so small is a surprising result. Also unexpected isthat the training error increases with an increasing number of neurons. We would expect the trainingto continue to decrease with an increasing number of neurons, as the increasing complexity of themodel becomes overfit to the training data.Figure 3: Optimizing the number of hidden-layer neurons of the neural network (? = 1.2 ? 10?7)The L1 regularization parameter ? of the artificial neural network was optimized using cross valida-tion, the result of which is shown in Figure 4. The optimal value was found to be ? = 1.2 ? 10?7.3.4 Dropout neural network regressionWe similarly optimized the number of neurons of a dropout neural network and the L1 regularizationparameter ? using cross validation and found the optimal number of neurons to be 26, as shown inFigure 5, and the optimal L1 regularization parameter to be ? = 1.4 ? 10?9.4Figure 4: Optimizing ?, the L1 regularization parameter of the neural network with 2 hidden-layerneuronsFigure 5: Optimizing the number of hidden-layer neurons of the dropout neural network (? =1.4 ? 10?9)Randomly omitting neurons while training effectively randomly trains 2n different models, wheren is the number of neurons. This ensemble of neural networks acts as a form of regularization, andso it is not surprising to see that the optimal value of the L1 regularization parameter ? is 100-foldsmaller for the dropout neural network than the canonical neural network.3.5 Random forest regressionTo optimize the random forest, we tried limiting the depth of the tree and increasing the numberof trees and found that fewer trees with unlimited depth performed better. Our final random forestregressor was composed of 50 trees. The number of trees was limited by available computing power,and using more trees should improve performance.We used 1,000 features and varied the split between the number of features selected by the largestabsolute lasso weights and by the largest information gain in distinguishing the salary quintiles.We found that a split of 800 features selected by lasso and 200 features selected by informationgain performed best. We also tested 2,000 features and found, somewhat unexpectedly, that 1,000features performed better. Since each tree is trained on a random subset of features, it becomesimportant that each feature be informative.54 ConclusionsThe random forest outperformed the neural network, which outperformed the dropout neural net-work, which outperformed the lasso regression, which outperformed the maximum-likelihood re-gression. The only surprise here is that the dropout neural network did not perform better than thecanonical artificial neural network. A dropout neural network is able to use more neurons than acanonical neural network while still avoiding overfitting, and because it is effectively an ensemblemethod, it should reduce error due to variance, just as a random forest does.We are able to predict the salary of a job using a textual description of that job to within a meanabsolute error of ?5,000. A human tasked with the same problem would not, we expect, performany better at predicting salaries. We find these results quite satisfactory. That being said, there isalmost certainly room for improvement.The bag-of-words model used is the simplest possible feature space. Other natural language pro-cessing techniques could help, such as removing stop words and using longer strings of words asfeatures such as bigrams and trigrams. Another appealing option would be to use simple syntaticanalysis to extract noun phrases such as ?heavy machinery technician? from the description.An additional machine learning technique to consider would be k-nearest-neighbours adapted forregression, where the prediction is a weighted average of the salaries of the k most similar jobdescriptions, and closer neighbours are weighted more heavily than distant neighbours.We expect that the data would have underlying structure. For example, a number of job postings maybe summarized as ?A senior programming job in London?, and we expect these jobs to have similarsalaries. It would be valuable to explore machine learning techniques that attempt to capitalize onthis underlying structure, such as a deep-learning neural network, or latent Dirichlet allocation [11],which assumes that the distribution of words for each job depends on a latent category of that job,such as ?service industry?, and attempts to learn those categories from the data.References[1] Kaggle Inc. (2013). Job Salary Prediction.http://www.kaggle.com/c/job-salary-prediction[2] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). FeatureHashing for Large Scale Multitask Learning. Proc. ICML.http://alex.smola.org/papers/2009/Weinbergeretal09.pdf[3] Austin Appleby (2011). MurmurHash.https://code.google.com/p/smhasher[4] John Langford (2007). Vowpal Wabbit ? a fast online learning algorithm.https://github.com/JohnLangford/vowpal wabbit[5] Pedregosa, Fabian, et al. (2011). Scikit-learn: Machine learning in Python. The Journal of MachineLearning Research, 12, 2825-2830.http://scikit-learn.org[6] Jonathan Feinberg (2013). Wordle ? Beautiful Word Clouds.http://www.wordle.net[7] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety, Series B 58 (1): 267?288.http://www.jstor.org/stable/10.2307/2346178[8] Hinton, Geoffrey E., et al. (2012). Improving neural networks by preventing co-adaptation of featuredetectors. arXiv preprint arXiv:1207.0580.http://arxiv.org/abs/1207.0580[9] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.http://link.springer.com/article/10.1023/A:1010933404324[10] Wikipedia authors (2013). Information gain in decision trees.http://en.wikipedia.org/wiki/Information gain in decision trees[11] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of machineLearning research, 3, 993?1022.http://dl.acm.org/citation.cfm?id=9449376


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items