NLP Site Information

Introduction

The data provided in this online resource consists of word embedding models derived from a large corpus of textual data gathered from the institutional web domains of 50 elite U.S. universities. An automatic web crawler (spider) was used to scrape all textual data found in a university’s website by automatically following links within a University official online domain and collecting all detected textual content. Word embeddings were then calculated using the word2vec algorithm (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013).

Word embeddings (also known as word vectors) were discovered by leveraging, distributional semantics theory (DST). DST is a field of linguistics that studies theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large corpuses of textual data. The basic idea of distributional semantics can be summed up in the distributional hypotheses which postulates that linguistic items with similar distributions tend to have similar meanings.

The distributional hypotheses was first suggested by (Firth, 1957) who proposed that words that occur in similar contexts tend to have similar meanings with the famous sentence “you shall know a word by the company it keeps”. Therefore, the meaning of a word can be approximated by the set of contexts in which it occurs.

Recent advances in machine learning for natural language processing (NLP) have given credence to the distributional hypothesis. In particular, new techniques for creating word embeddings that leverage the context in which words appear in large corpuses of textual data have significantly contributed to improve the state-of-the-art in machine translation, sentiment analysis, part of speech tagging, document summarization, text classification and information retrieval. Word embeddings is the collective name used to design a set of language modelling and feature learning techniques in NLP where words or phrases from a vocabulary are mapped to dense vectors of real numbers. Each vector comes to represent in an abstract way the “meaning”of a word, with different dimensions encoding different syntactic and semantic connotations of the word.

The following Figure shows an illustrative mapping of words to embedding vectors. In such a vector space, different dimensions encode semantic and syntactic meaning based on the usage of the words on the corpus on which the word embedding model was trained. In the Figure, for explanatory purposes, some illustrative dimensions have been given idealized names for interpretability and to convey the idea of what is captured in the dimensions of the vector space. Such vector spaces possess the distinctive property that words that are semantically or syntactically similar are mapped to adjacent regions of the vector space (top right of the Figure). Since, it is not possible to visualize vector spaces larger than 3 dimensions, the geometrical properties of such multidimensional spaces can only be visualized by applying dimensionality reduction techniques to the original vector space. Thereby, we can see in 2 dimensions how the word embedding model manages to bring related words to nearby regions.

Illustration of words mapped to 7 dimensional numerical vectors also known as word embeddings. The dimensions of the word embeddings codify semantic and syntactic features of the words. In this figure, for the sake of clarification purposes, the dimensions are given illustrative names for interpretability. For visualization purposes, high dimensional spaces can be mapped to low dimensional spaces using techniques that preserve the geometrical structure of the original space. Word embeddings possess the distinctive property that words that have close semantic meanings are mapped to adjacent regions of the vector space (top right). Also, the vector space captures meaningful syntactic and semantic regularities such as certain directions codifying for semantic relationships between words like for instance, gender, as shown in the bottom right of the figure by the dotted lines.

Another property of word embeddings is that the learned word representations capture meaningful syntactic and semantic regularities between words, such as for instance gender or verb tense, and that said regularities are consistent across the vector space. The regularities are observed as constant vector offsets between pairs of words sharing a particular relationship. The dotted lines in the bottom right of the previous Figure illustrate a consistent gender axis that exists in the exemplary vector space. This property permits the usage of analogical reasoning to answer questions such as “man is to woman as king is to…” by using vector algebra of the form: where stands for the vector representation of word 𝑛. In a properly trained word embedding model with a sufficient and relevant text corpus, the result of the previous vector algebra operation will be a vector whose closest neighbor is the vector for the word queen.

A particular type of word embedding that has become very popular in the machine learning literature is the word2vec set of techniques(Mikolov et al., 2013). Word2vec uses a shallow neural network to learn a distributed representation of words based on the textual contexts in which they occur within a text corpus, thus leveraging the distributional hypothesis (Firth, 1957).

After training word2vec on a text corpus, words that are used in similar contexts will end up with similar numerical vector representations. One of the most impressive capabilities of word2vec is its ability to draw together words that are used synonymously in similar contexts even if they never appear together in the training corpus. This feature is a key component of the ability of word2vec to generalize.

One of the main advantages of working in a vector space is that we can use cosine similarity to quantify the proximity of word embeddings in said vector space. Since word2vec brings words used in similar contexts, and thus semantically related according to the distributional hypotheses, to adjacent regions in the vector space, the context in which a word is used in a corpus of text can serve as a reliable proxy to estimate the semantic denotation with which the word is used in the corpus.

University domains scraping

Between February and April of 2018, the Internet domains of 50 elite universities in the U.S. were scraped using Scrapy, an application framework for automated crawling and extraction of data from websites. The list of universities scraped was taken from the top 50 entries in the US News University Ranking Charts of 2017. The scraping process collected only the textual content of HTML elements such as <p>,<li>,<td>, <div>,<td>and <a>. HTML elements not containing natural language such as structural, scripts or styling elements were specifically left out. Coding logic was used to prevent redundant scraping of already fetched nested elements.

The scraping process started at the base URL of each University domain and proceeded to extract all target textual elements and to follow the detected links pointing within the University domain up to a predefined depth level to continue collecting textual elements. A depth first crawling algorithm for visiting scraped links was followed for memory efficiency reasons.

Different universities organize information in their websites in tree structures of varied characteristics. Therefore, scraping the 50 universities domains to a fixed depth limit generates output files of very different size. Consequently, scraping sessions were carried out at different depth levels for each University domain to ensure the volume of text scraped from each University fell within the same order of magnitude. The file sizes of scraped textual data from each University web domain covered a range from 170 MB to 611 MB (mean=324MB and SD=96MB). In total, a text corpus of size 16 GB was retrieved. All the scraping rounds for each domain were carried out to completion to ensure sample representativeness. The robots.txt directive for each University domain was always respected.

Generating word embeddings using word2vec

For a computational analysis of the meaning of a word based on the context in which it occurs, an encoding scheme that permits the quantification of similarity between words is needed. In the machine learning community, the state-of-the-art method for generating such an encoding models words using word embeddings.

Word embeddings models represent (embed) words in a continuous and dense vector space were semantically similar words are mapped to nearby regions. These models make use of the distributional hypothesis, which states that words that appear in similar contexts share semantic meaning. There are two main computational approaches to generate word embeddings: count-based methods and predictive methods. Count-based methods compute the statistics of how often word pairs co-occur in large text corpuses and then map the count statistics to a small dense vector for each word. Predictive models are trained to predict a word from its neighbors using a small dense embedding vector serving as the learned parameters of the model. Of the different methods to generate word embeddings, word2vec is a particularly computationally efficient predictive model with robust performance.

Word2vec provides two similar shallow neural network architectures to compute word vectors: the Continuos Bag-of-Words model (CBOW) and the Skip-Gram model. The CBOW model is trained to predict target words from source context words. The Skip-Gram model does the inverse and is trained to predict source context words from the target word. Both CBOW and the Skip-Gram model work by using a neural network to nudge together in the embedding space vector representations of words that appear in similar contexts and nudge apart vector representations of words that do not often appear in similar contexts.

To generate the word embeddings of the corpus vocabulary, the gensim (Řehůřek & Sojka, 2010) implementation of word2vec was used. Both the continuous bag of words (CBOW) architecture and the Skip-Gram architecture generated similar results but the CBOW performed slightly better in the gathered corpus so it is the model provided in this website. For training the word vectors, the following parameters were used for the genism word2vec class: vector dimensions=300, window size=15, minimum word count to be included in the model=5, negative sampling=10, down sampling frequent words = 0.001, number of iterations (epochs) through the corpus=5.

Quantifying similarity between word embeddings

In order to quantify the similarity between word embeddings, the cosine similarity metric was used. The cosine similarity between two nonzero vectors v and w computes the cosine of the angle between them, to quantify their similarity in the vector space they inhabit. Two vectors with similar orientations have a cosine similarity close to 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, as illustrated in the following Figure.

Similarity between vectors can be estimated by calculating the cosine of the angle θ between them. Normalized vectors pointing in similar directions and therefore adjacent in vector space will have cosine similarities close to 1. Dissimilar vectors pointing in very different directions, with a large angle θ between them, will have negative values of cosine similarity.

Evaluating the quality of the word embeddings

Estimating the quality of the word vectors created by a word embedding algorithm is essential to evaluate the potential usefulness of the model. Yet, the word2vec training algorithm is unsupervised. Therefore, there is no optimal way to evaluate the quality of the word embeddings generated. There are however proxies that provide fair approximations.

The creators of word2vec released a testing set of about 20,000 syntactic and semantic test examples for analogical reasoning of the type “A is to B as C is to D” such as “man is to woman as king is to … “. If the word embedding representation is able to estimate from the vector algebra operation a vector whose closest neighbor is the vector for Queen, that word analogy instance is considered as correctly classified by the model. In this task, our word embeddings model trained on the entire universities data corpus achieved a classification performance of 60% (58% for semantic analogies and 61% for syntactic analogies). This is not too far from what the original authors of the method achieved on a much larger corpus of news articles 77% (77% for semantic accuracy and 76% for syntactic accuracy). It is important to note that one of the components of the semantic accuracy analogical reasoning test in the original paper was about currency analogies of the type “US is to dollar as Europe is to euro“. The information required to generate correct answers to such analogies is common in news articles sources, particularly in the finance section. Yet this information rarely occurs in University websites, so our model trained on the University corpus achieves a 0% classification accuracy on the currency analogy task which contributes to decrease the overall average semantic accuracy of the analogical reasoning evaluation metric.

An additional method to evaluate the quality of the word embeddings is to test the ability of the model to generate estimates of similarity between word pairs that are close to human judgment. This is done by comparing the similarity scores generated by the model to sets of word pairs with a preestablished similarity score determined by a human panel. A test set of 353 word pairs is often used by the genism library to evaluate the quality of word embeddings generated by word2vec. Our model achieves a 60% Pearson correlation coefficient (62% Spearman) with human judgment on this task. This is on par to what the original word2vec model achieves when trained on a larger corpus of news articles (62% Pearson and 66% Spearman).

References

Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955.

Maaten, L. van der, & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (pp. 3111–3119). Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta, Malta: ELRA.