My Predii Internship Experience: Creating a Multipurpose Sentiment Analyzer
My name is Pranav Devabhaktuni and I am a junior in high school in Atlanta. This summer, I spent six weeks interning at Predii!
We would have daily meetings where we would review my code, talk about how my program could be improved, and talk about additional configurations that could be added to my program.
Tweets & Reviews As Sources of Actionable Information
My project was to build a multi-purpose analysis tool to collect and extract insights from publicly available data, such as tweets regarding COVID and vehicle customer satisfaction reviews. This program runs in Python, collects data, performs sentiment analysis, calculates basic statistics, and creates visualizations.
The project required multiple use cases in order to prove its multipurpose nature. The first proving ground was to see if the sentiment and the volume of tweets from a location could be potentially be associated with the amount of new COVID cases in the area.
The second purpose of the program was to analyze reviews from people who talked about their vehicles, to assess the sentiment within this data, and to determine the most common positive and negative feedback.
COVID Tweets Key Takeaways:
Everyone is tweeting about COVID – there’s little correlation state-by-state between influx of new cases and tweet frequency.
COVID appears to break common sentiment analysis algorithms: in the context of a pandemic, “new” and “positive” are not positive words!
Vehicle Reviews Key Takeaways:
MPG, Driveability, and Comfort are the top 3 things driving both positive and negative experiences with their vehicle. For example, great mileage is the top positive feedback, and its inverse, awful mileage, is the top negative feedback. This mirroring is fairly consistent – with the exception of the dealership experience. Dealership experience can cause extreme negative reviews – but no positives. You can have a neutral dealership experience, or a negative one, but the only way to win is by not losing.
The program was broken down into three major steps:
Collecting, Storing, and Extracting Data
Using Sentiment Analysis to Determine Context
Analyzing and Visualizing the Data
The overall architecture and data flow is shown below:
1st: Collecting, Storing, and Extracting Data
We used public data sets for this project. We began with research into tweepy, a Python library that allows easy use of the Twitter API. Tweepy allows the streaming of live twitter feeds into programs for analysis, and it also allows the filtration of tweets based on users or certain keywords.
The first keywords used were “COVID,” “COVID-19,” and “Coronavirus.”
Through the “TweetPuller,” the program was able to collect various types of data, including the coordinates of the tweet and the contents of the tweet.
Here is a code example of the request and response:
class StreamListener(tweepy.StreamListener): def on_data(self, raw_data): try: data = json.loads(raw_data) print(data['text']) if data['text'].find("RT", 0, 4) == -1 and (data['coordinates'] or data['geo']): lat = str(data['coordinates']['coordinates']) lon = str(data['coordinates']['coordinates']) csvFile.write(str(data['text'].encode("utf-8")) + "|" + lat + "|" + lon + "\n") csvFile.flush() except Exception as e: print(e) streamListener = StreamListener() myStream = tweepy.Stream(auth = api.auth, listener=streamListener, tweet_mode = "extended") myStream.filter(track =["COVID-19", “COVID”, “Coronavirus”], languages=["en"]) --------------------------------------------------------------------------------------- 'Hand loomed 3-layered Face masks for Incoming First Generation College students at UCSB. Keeping students Covid-Safe https://t.co/2hRC3KEP8j'|34.439747|-119.760712
The next step was to append the tweets to a .csv file which would then be read by the “TweetAnalyzer” file in the program.
For the second use case, analyzing vehicle data, the “TweetPuller” pulled data related to 5 specific car makes and their models. Unfortunately, the majority of the data were advertisements for the car make and model. We wanted true customer feedback. We found OpinRankDataset, which contained approximately 42,000 vehicle reviews. These reviews were split by year, make, and model.
2nd: Using Sentiment Analysis to Determine Context
In order to process the raw data pulled from Twitter, we needed a plug-in with pre-trained models that determines sentiment from raw data provided to it. We chose VADER because it is especially attuned to social media analysis, and emoji interpretation.
For use case two, analyzing vehicle data, we switched to using TextBlob because it was trained on product reviews rather than social media posts. In this way, we demonstrated the programs multipurpose flexibility – involving various plug-ins that were best suited to the use case at hand.
In both cases, with both VADER and TextBlob, raw data was inputted, and the output was an assigned number between -1 and 1, with -1 being the most negative possible assessment, and positive one being the most positive, with ranges in-between of various intensity.
3rd: Analyzing and Visualizing the Data
The final step was to begin creating intelligence from the data. We’ll begin with the COVID results.
To begin, we formulated basic statistics. Using these -1 and +1 sentiment scores, we calculated mean, variance, standard deviation and median for the tweet data. The data was also then further segmented into two buckets, negative and positive, and the same statistics were applied to those two buckets.
Then, we began to create graphical representations of the data. Using matplotlib, a Python library that allows for easy plotting of data, a histogram was created regarding the number of characters and the number of words in the tweets. We also created a bar graph that would show the most common words based on the sentiment of the tweet.
One problem encountered was that many stop words and noise, such as “was” or “is,” showed up in the results. We used .nltk (a Python natural language processing library) to filter these out.
The final step was to plot the latitude and longitudes of the tweets, and superimpose them upon a map of new COVID cases. We used the CDC’s publicly available COVID dataset. We did this using the Python library plotly. See the graphs below.
When analyzing and visualizing the vehicle review data, the same essential steps were performed, with a few modifications. One such modification came from an issue that occurred: the most frequently discovered words did not provide much information, they lacked context. We used the N-grams data grouping technique, which surrounds results with the most commonly used words they occur with. This allowed us to determine the context of top results. For example – ‘mileage’ by itself doesn’t tell you too much. Great mileage or awful mileage are much more useful.
With the first use case, interpreting COVID tweets, we were successfully able to process this data, and visualize it in certain ways. We created a histogram of word frequency, a bar graph of the most frequent ‘positive’ words (more on this later), and a map of tweet locations superimposed upon COVID case frequency.
Some interesting analysis is possible from this data.
First, VADER is a widely knon and commonly accepted sentiment analysis tool – it is generally accepted to accurately deem positive and negative sentiment in given source data. However, COVID appears to break it. The most common words in the “positive” tweets are: “new” and “positive” – which, in the context of a pandemic, is not actually a good thing!
Secondly, the majority of the tweets we detected came from New England and the Midwest – not from the epicenters we see in Texas and California. These tweets came from 8/23/2020 - 8/29/2020, and were the tweets that included geographic location (a very small subset of overall tweets).
Map of tweet coordinates superimposed on new COVID-19 cases choropleth map.
The results for use case two, analyzing vehicle reviews, were even better than the COVID use case. We were able to determine with great accuracy the features that customers care about the most in their vehicles – and it will come as no surprise that mileage tops the list.
In order to assess the context within which given results occurred, we used N-grams to map the surrounding sentiment from any given positive or negative result. This was step one.
As you can see, this creates a little bit of repetition – for example: great gas mileage | good gas mileage | better gas mileage.
We then mapped these N-grams into the buckets you can see below:
As you can see, mileage is clearly the most important – but maybe its interesting to automakers that fun is more important than comfort.
We performed the same process for the negative reviews, and came up with these results:
As you can see, mileage, drivability, and comfort are still the top three – but what’s interesting is result #4. A negative dealership experience can severely impact your brand. Notice how it doesn’t get anywhere near the top 10 in the positive results!
By the end of the project, we had successfully built a multipurpose Sentiment Analyzer, one that could plug into varied data sets and successfully display the hidden “intent” within.
In conclusion, my experience with Predii was amazing. I learned a lot about NLP, object-oriented programming, and data structures thanks to Mr. Ho and Mr. Dalal who were gracious enough to work with me. I still have a plethora to explore in the fields of programming, data science, and machine learning, but this was an amazing introduction.