By Nitya Kasturi and Aniket Dalal
Note: we reference several internal documents and scripts within this report. Contact us if you are interested in the files mentioned.
The idea of neural networks for entity discovery was explored in this paper to see if we could use neural networks, a supervised training technique, to extract information an unsupervised training method would normally provide. NeuroNER, an entity recognition framework for Tensorflow, was primarily used to train and test on our data.
We check to see if — given documents of service data — a neural network could learn and extract the names of possible components, the parts of a product that can be replaced or serviced.
We choose to view this problem as a Named Entity Recognition (NER) problem, by labeling known components in our training.
This problem has been previously solved using Conditional Random Fields (CRFs), and we wanted to explore the idea of using a neural network to discover different components than what the CRF method has discovered.
Predii receives an enormous amount of textual technical documents. These documents contain mentions of a wide variety of components and systems, and the challenge for us is to discover an exhaustive list of them.
We chose to view this problem as a Named Entity Recognition (NER) problem. NER involves first the identification of proper names in texts and then their classification into a set of predefined categories of interest. In our case, our predefined category would be ”Component."
Given that this has already been conducted using conditional random fields, we use that example for a comparative analysis, investigating whether the use of deep learning using recurrent neural networks can produce similar or better results.
A. Conditional Random Fields
Conditional random fields are a class of statistical modeling methods that are often applied for prediction use cases. Essentially, a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets X and Y, the observed and output variables, respectively. The conditional distribution p(Y X) is then modeled.
B. Recurrent Neural Networks
Although neural networks are a supervised method of training, we wanted to see if neural networks could produce similar results compared to the CRF method, provided we have some annotated training data.
A recurrent neural network (RNN) is a class of artificial neural networks where connections between units form a directed cycle. This allows the network to exhibit dynamic temporal behavior. RNNs can use their internal memory to process arbitrary sequences of inputs along with the current input. These models are often used for natural language processing due to this unique quality.
We more specifically make use of long short-term memory (LSTM) in our recurrent neural network. An LSTM block is composed of four main components: a cell, an input gate, an output gate, and a forget gate. The cell is responsible for ”remembering” values over arbitrary time intervals. Each of the three gates can be thought as a ”conventional” artificial neuron, as they compute an activation of a weighted sum.
C. Data Format
The data is provided in a CSV file with each row corresponding to one work order. We only extract the service and repair notes to use as our training and testing data. Something to note is that the data is not always in a complete sentence, so there are many unnecessary characters that need to be removed. Further, most of the data needs to be formatted into a data form that is readable by Tensorflow and NeuroNER, which will be discussed later on.
D. Tensorflow and NeuroNER
The primary software used to train is Tensorflow, Google’s open-source software library for machine learning and data analytics. We more specifically used an architecture found online called NeuroNER, which is a named entity recognition program that uses Tensorflow for its neural network applications. NeuroNER takes in text documents with annotation files consisting of the word, the entity type, and the character range in the text document specified. The output is similar; it returns a word and its classification. If multiple words are classified as a single entity, the output represents that by indicating it is a continuation of the last entity.
NeuroNER consists of several layers: a character-enhanced token-embedding layer, a label prediction layer, and a label sequence optimization layer. The character-enhanced token-embedding layer maps each token to a vector representation. The sequence of vector representations corresponding to a sequence of tokens is then inputted into the label prediction layer, which outputs the sequence of vectors containing the probability of each label for each corresponding token. Lastly, the label sequence optimization layer outputs the most likely sequence of predicted labels based on the sequence of probability vectors from the previous layer. All layers are learned jointly.
NeuroNER allows for the modification of hyperparameters, such as learning rate and dropout rate. We have purposefully decreased the original learning rate as the model would learn very quickly in the beginning epochs, implying that the learning rate was too high. The original NeuroNER implementation also had a CRF layer, and to purely see the results based off of the neural network, we removed the CRF layer.
The following section goes over how our data was formatted into input for NeuroNER’s network and how the output was extracted for analysis. For a more in-depth look into the specifics of the code, please take a look at the README, code, and code documentation provided.
A. Summary of Pipeline
The CSV file is first fed through a script named extract.py that annotates the text and produces text files and annotation files into the format that NeuroNER can read. From there, the data is then passed into NeuroNER. To look at our results, we use several scripts to parse the results into frequencies of components and compare with the results from the CRF model. get-component-names.py extracts the output from NeuroNER into a component list with the frequencies of each component. sort.py sorts this list by greatest frequency, and compare.py generates a CSV that compares any two given frequency lists.
B. Preprocessing Data
In the beginning of our training process, we realized that preprocessing of the data is necessary if we want useful component names to be extracted. Without any preprocessing, the model, although showing high accuracy percentages of 90% and higher, could not extract useful component names with 15 epochs. The most frequent component that came out of the NeuroNER training was ’package,' with 1833 occurrences.
To preprocess the text, we did several things. We first split the data into separate text files by sentence. We removed any periods that aren’t at the end of a sentence or used in a decimal number; any forward slashes that aren’t used as a fraction, parentheses, colon or semicolon; other miscellaneous characters (anything not alphanumeric); and any extra spaces.
Words were annotated in four categories: components, stop words, verbs that relate to a component, and other. Also, to make training simpler, we keep the lengthier of the components, if both have an annotation. For example, if ’battery power’ appears in the text, but both ’battery’ and ’battery power’ are components, this would be annotated as ’battery power’ for training purposes.
C. Code Structure
The code was structured such that everything before training is in one file and everything after training is in another file. This is what was executed before training:
- Preprocess the text
- Separate the text by sentence
- Annotate each sentence with the provided annotations of which words are components
- Create annotation files for each sentence
- Separate sentences into training, validation, and testing categories (50%, 25%, and 25% respectively)
In the next section, we will go into detail about how the results were processed for comparison. For more details about the code base, please read the README, code, and code documentation.
D. Processing Results
The output of NeuroNER’s training splits all text by word and lists the classification of each word right next to the word. If there is a multi-word classification, such as ”battery pack," there is a different label to indicate that the classification is a continuation of the last word.
From this output, we extract all annotations and create a frequency mapping for each component. From there, we compare the frequencies of components annotated by CRF with the components annotated by the trained neural network.
FIG. 1. Architecture of the artificial neural network (ANN) model. n is the number of tokens, and xi is the ith token. l(i) is the number of characters and xi,j is the jth character in the ith token.yi is the predicted label of the ith token.
In this section, we will discuss the results of our training and how it compares to the Conditional Random Field results.
A. Numerical Analysis of Training
With our final iteration of training with four categories of classification, we were able to achieve an overall accuracy of 84.31% on the testing set and correctly identified 125007 phrases. Figure 2 displays how the accuracy has grown over the training process for the training, testing, and validation set.
The most commonly identified component are ”packages” with a frequency of 2066, and ”trays” with a frequency of 1221. The first two-word component identified is ”sealer belt” with a frequency of 279. In order to check the accuracy of the results in terms of frequency, we also note the frequency of each of these terms in the data. Compared to the actual frequency of the words found, NeuroNER with the annotated training data performed fairly well with an accuracy of around 51%, considering some extraneous words that had a high frequency in the data.
NeuroNER found around 51 new components from the preprocessed data out of 240 useful components found. The rest were a subset of the training data. Please refer to the attached documents for a more in-depth analysis in terms of accuracy of the model. Results.xlsx contains a histogram of the accuracy distribution of the components, as well as the frequency found in NeuroNER and the actual frequency of the components found. NewComponents.txt contains all the new components discovered in this training process.
FIG. 2. Accuracy growth over 25 epochs.
Even though the results seem comparable to the CRF results, the component names are much shorter and have lower frequencies in the data. Neural networks are better used when we need to reinforce what is already known, rather than new information we want to discover from training. However, the NeuroNER model appeared to find reasonable components not in the training set. Unsupervised methods would work best for this type of scenario, but NeuroNER performed better than expected.