This blog is a quick business walkthrough of a technical experiment our former intern, Nitya Kasturi, performed earlier this year with the support of our engineering team. How can you free up the time of your domain experts with automation?
Problem Statement
Every organization has a set of specialized employees who truly understand your product and how it performs in the real world. The time of these employees is incredibly valuable, yet because their knowledge is so niche, it frequently gets spent in highly manual tasks that no one else can do.
In the automotive world, cars are typically bucketed together with similar cars to create a common framework from which performance, price, features, cost of ownership and other metrics can be driven. The industry uses this for various reasons, including guiding customer journeys — "What kind of sedan should I buy? Which sub-compacts get the best mileage? What midsize SUV has the highest safety rating?"
These groupings go far beyond guiding the customer, however: understanding these groups is essential to any kind of qualitative decision making in the automotive industry. In order to make meaningful data-driven decisions, you need to be making apples-to-apples comparisons – whether they are influencing sales, quality engineering, servicing, or other business functions. For example, you wouldn’t assess a subcompact by its towing capacity, but if your data is mixed with inputs from pickups that is just one more level of noise that needs to be filtered out by the experts before they can begin making decisions.
This, at a high level, might seem simple: just put the similar cars together, right? But when you consider the sheer size and variety of the data flowing in the automotive ecosystem – sales, servicing, quality engineering, multiple ERPs and CRMs from dealerships, parts suppliers, OBDII diagnostics, warranty, insurance, and much, much more – simply establishing a common framework from which to work soaks up a gargantuan amount of domain experts’ time. An automated way of ingesting data from any source and classifying it properly is thus absolutely necessary.
This is a walk-through of a machine learning research project, not currently commercially deployed. It will demonstrate some of the problems faced within the automotive industry and how to address them with automated solutions. We will explore the idea of grouping cars based on features such as engine, body, make, and model.
Car (Vehicle) Platform groups
One of the challenges with applying AI/ML to the automotive vertical is that certain vehicles simply have scarce data. There are 20,000 varieties of Year/Make/Model/Engine on the road today; not all of them will have strength behind them in the data source you are analyzing. What we also see is that vehicles having different models or even makes may share the same engine or transmission system. Transfer learning here is crucial to worker efficiency.
To take advantage of this commonality, we create buckets where “similar” vehicles are grouped together so that we can combine their data sets to overcome the challenge of data scarcity. The definition of “similarity” came from OEMs and SMEs (Subject Matter Experts) who are familiar with car platforms and generations.
The objective for us was to build a machine learning model based on the training data set, such that:
We can auto-classify unclassified vehicles and new vehicles to one of the existing buckets with high accuracy, replacing the manual process
We can find gaps and errors in the manual classification of vehicles
We used the labelled data sets to train a classification model, then tested that model to determine accuracy.
Data Sets
The total training data set consisted of 99,000+ vehicles classified by a human expert into 1000+ groups. Each of the vehicles has the following primary attributes and features:
Year of Manufacturing (for example, made in 2017)
Make of the Vehicle (for example, a 2017 Honda)
Vehicle Model (for example, a 2017 Honda Civic)
Vehicle Sub-Model (for example, a 2017 Honda Civic Touring Coupe)
Body Type (for example, a 2017 Honda Civic Touring Coupe)
Engine (For example, a 2017 Honda Civic Touring Coupe 1.5L 4-Cylinder Turbo Engine)
Classification categories (SME Manually Annotated – painful!)
Engine Group: based on the engine type
Body Group: based on the body type
Processing Approach
A decision tree is a supervised learning algorithm. A supervised learning algorithm will take in inputs with various features, requiring a label for each input to train on. In our use case, the features were extracted from the vehicle attributes (features A through F) and the labels we used for this experiment were the engine buckets defined by SMEs.
A decision tree is excellent for scenarios like this where there are very distinct feature values and when the feature values themselves are very useful for determining groups. However, decision trees tend to overfit by creating overly complex trees that do not generalize the data well. We used the CART (Classification and Regression Trees) implementation of the decision tree. Our decision tree was trained and tested on vehicles manufactured from 1995 to 2017 (120,536 Y/M/M/E varieties). The accuracy of the decision tree model was 79% for this data from 1995-2017. We evaluated the ability of the model to auto-classify using a scenario similar to how an SME would employ it in the real world – just looking at new incoming data, from cars manufactured in 2018. The model trained on 1995-2017 data performed classification on 2018 data with 97% accuracy.
There are plenty of algorithms available for classification activities like the use case we cover in this blog. Some examples might be SVMs (Support Vector Machines), Naive Bayes classifiers, and Logistic Regressions. We found that the decision trees were more effective in classification when working with a large number of classes (leverage groups) and especially when the number of data samples per classes were small.
Results
When compared to the manually labeled data encompassing the entire range of vehicles from 1995 to 2017, our algorithm was able to recreate the same classifications with 79% accuracy. This is due to the huge variety and range of vehicles - some vehicles are more rare than others, lacking the support in the data for quality algorithmic classification. Further, the labelled data sets themselves are imperfect - some of the decisions made by SMEs are based on intuition.
When the same processing algorithm is applied unlabeled data sets from just the year 2018, it resulted in 97% accurate classification. Both accuracy percentages were verified manually.
What does this mean in the real world? It means that the time and effort and cost it takes to implement any kind of data-driven products or workflow improvements is drastically reduced. It means that domain experts can spend less time making sense of the noise, and more time using data to make decisions. This is the kind of worker-augmentation that AI is meant for!
コメント