ScoreData announces Promotion for HPE Customers

July 26th, 2016

Promo Title: HPE Big Data Marketplace Promotion

Promo Description: ScoreData Corporation is pleased to offer HPE customers 3 months of ScoreFast license at no charge for ScoreFast based on trailing six months of data. This will be extremely useful as you build your Big Data Analytics solutions.

This offer expires on December 31, 2016.

For more information, please write to us at

  • Share/Bookmark

Predictive Analytics for the Intelligent Customer Engagement Center

July 20th, 2016


The ScoreData Team

A Customer Engagement center is a central point from which all customer contacts, including voice calls, email, social media, faxes, letters, etc., of an enterprise are managed.  It is part of a company’s customer relationship management.  With the large amounts of data collected by modern engagement centers, the possibilities of applying predictive analytics to improve engagement center efficiency and customer satisfaction are very many.  The applications can be broadly classified into the following categories:

  • Enhancing Customer Engagement and improving customer experience. The possible applications include:
    • Improved caller/agent matching
    • Enhanced cross-selling and upselling
    • Superior customer retention
    • Interactive Voice Response (IVR) analytics
    • Behavioral targeting to better serve customers
  • Optimizing contact center management and control. Possible applications include:
    • Agent ranking and performance measurement
    • Improved call routing and distribution
    • Centralized global queueing
    • Staffing optimization

ScoreData has extensive experience in building predictive models using their ScoreFast™ engine, many of which can be integrated with caller-business engagement scenarios, e.g., improved customer retention by churn prediction and mitigation, enhanced cross-selling and upselling, risk analytics, etc.  This paper is about a staffing optimization project undertaken in collaboration with our partner Avaya.  Then we go on to examine how predictive analytics will help contact centers of the future.

Customer Engagement centers employ a large number of contract agents as the workloads tend to vary significantly over the course of a year.  Being able to predict workload and staffing requirements to keep the customer wait times less than a specified threshold is critical for agent capacity planning.

Engagement Center Optimization Example

Business Objective: To build a workload prediction system that accurately forecasts call volumes and agent capacity requirements for the following week.  While forecasting is a common use case in many verticals, as the data, the number of variables, and the requirements for finer levels of forecasting increase, the modeling problem becomes more challenging.  Avaya provided the data for this project from a demonstration system using historical data sets.

Methodology: We used the following methodology for the forecasting project:

  1. Data preparation, audit and understanding- We engaged with Avaya teams to grasp the business context and understand the data (tables, fields and their meanings).  We normalized and merged all datasets to form a single view of the data for modeling.
  2. Feature Engineering and Data Loading – ScoreData created information rich features from the single view dataset. The resultant dataset was loaded into the ScoreData analytics platform.
  3. Feature Extraction and Model Development – We conducted statistical analyses on the ScoreData platform to generate the best predictors of the target variables. Then we developed various models for forecasting workload and staffing.
  4. Insights and Report Preparation – Analyses outcomes and insights were collated in a report along with model characteristics.

Data Audit and Insights:

  • Looking at per-day level call volume data, one can see two flat periods (with no activity) in the data (‘call volume vs. days’ graph below).  As a consequence, ScoreData did not use this as a homogeneous time series data (from 6/11/13 to 7/22/13), rather we only used the data for the two time periods that have call volumes, ignoring the flat periods.

  • Call Segment Data Funnel – Step by step data stats on applying filters to remove Interactive Voice Response (IVR) calls and data stats:
  1. Total number of segments (or calls): 9,242,604
  2. After removing IVR Calls which we can not use: 2,107,867
  3. Removed several more rows that did not meet Avaya’s criteria for including in the analysis
  • Average Number of agents handling calls for each hour

  • Average Queue Group Time  (Ring  + Talk + After call)

  • Average Queue Group Time (without the Talk time)

  • Time Distributions

  • Disposition Time of Agents

Models: Since the call volumes varied by day, prediction models were built for each day of the week.  We used the Generalized Linear Model (GLM) algorithm with Poisson distribution to predict the number of agents required for a given call volume, wait time, day of week and queue group.  The following figure shows the features with the highest relative importance.  The most important feature turns out to be the Number of Calls, with wait time coming in next; there is a negative dependence on wait time (highlighted in orange).  The greater the wait time allowed, the fewer the agents required.

The project was useful to the ScoreData team to understand the nature of call center data and to create a unified view of the data. Forecast of call volumes was done on a daily basis.  The next step in the project, given enough data, is to forecast call volumes on an hourly basis and then predict staffing requirements also on an hourly basis.

Analytics for the Engagement Centers of the Future:

In this section, we will examine a few emerging trends in the contact centers of the future and discuss how analytics will play a crucial role in the transformation of such centers.  The engagement centers of the future must be agile enough to adapt quickly as customers’ expectations shift with advances in and varieties of interaction options open to them.  Analytics will play a most important role as contact centers adapt to the changing demands of the future.  According to Laurent Philonenko, SVP of Corporate Strategy & Development, CTO of Avaya, most businesses that are making analytics an urgent investment are doing so to be better positioned to (1) compete more successfully, and (2) grow their business to increase revenue potential.

Virtualized Engagement Centers: The engagement center of the future is not likely to be a centralized facility, but will be distributed geographically with agents working out of their homes.  High turnover in the engagement center staff is causing business leaders to look for effective ways to attract and retain the best talent.  Flexibility in working hours and workplace is an important factor in this.  Virtualization technology will play a key role in making this practicable.  Key issues here are security and privacy of customer data.  Machine Learning and Analytics are increasingly playing a critical role in detecting security breaches and avoiding future attacks.  Analytics techniques are now being applied across the network, application and data layers to provide increased security and privacy.  One of the techniques used to provide privacy is data encryption.  Ability to process encrypted data and draw insights will become increasingly important.

Future Agents: It turns out that today contact center staff accounts for almost 75% of the cost of running a center.  Therefore, it is important to optimize agent performance with the right tools and information.  Agents will increasingly have to multitask among voice calls, social, chat and email interactions with the customers.  Today, most agents have to toggle between a set of standalone software applications to access the information they need to service customers.  Agents will need a seamless view of customer information to fluidly meet customer needs.  With the digital transformation of an enterprise, the data silos that are entrenched in today’s enterprises can be overcome.  Again according to Philonenko, in a digital enterprise, with a single analytics application it becomes much easier to have a single view into the entire journey of customer data, partner data, employee data, process data, etc.

Instead of specific agent skill groups that today’s contact centers employ, future centers will have a fluid workforce in terms of skills.  For example, a particular agent may have foreign language skills as well as the ability to cross-sell or up-sell effectively.  Agents will not belong to specific skill groups, but will be called upon to service customers based on the needs and demand.  Analytics will play an important role in forecasting the workload, optimizing the staff, and routing customer calls to the right agents to minimize the customer waiting times.  Optimal workload forecasting is an extremely challenging problem under those constraints.  ScoreData has worked out an approach to solving this challenging problem with predictive analytics.  We hope that will be the subject of another report in the future.

The way agent performance is measured and ranked will also need to change to keep up with customer demands.  Contact centers will need to transition from a reliance on efficiency-based metrics, such as Average Handling Time (AHT) and calls handled per hour, to customer and business focused measures like First Call Resolution (FCR), customer satisfaction, and ROI.  Analytics can be leveraged in that transition as well.  For example, customer feedback surveys – both ratings and comments – can be analyzed to assess customer satisfaction.

Future Customers: According to the American Express 2011 Global Service Barometer, U.S. consumers prefer to resolve their service issues using a variety of touch-points, including the telephone (90%), face to face (75%), company website or email (67%), online chat (47%), text message (22%), social networking site (22%), and using an automated response system (20%).  And according to the 2014 Global Service Barometer, for simple issues consumers prefer going online (36% versus 14% by phone) and for difficult enquiries talking to an agent by phone (48% versus 10% by email).  Consumer preferences will keep changing.  When they escalate a service issue from chat to voice, they expect the new agent to know the interaction that has already happened.  Cloud-based contact center and analytics can help agents to follow a customer’s journey seamlessly.

How ScoreFast™ Makes a difference

The entire project with the run-time engine was delivered in six weeks.  Our models were built using  the ScoreFast™ engine, the web based model development and management.  It is built as an enterprise grade modeling system that can be used to develop a broad range of models for use cases inside and outside the Engagement Center.

Its data and model management modules are easy to use, dashboard driven and intuitive. It chooses models for specific use cases after trying out many algorithms internally and selecting the one with best performance metrics. ScoreFast has collaboration features encouraging sharing and collaboration within large cross functional teams, and access control features that are designed keeping in mind specific needs of ScoreData’s big enterprise clients. The collaboration features encourage cross-functional knowledge sharing and innovation within the companies.

The platform has built in hooks to link raw data feeds into the system and its one push provisioning features mean models once developed and tested can be deployed onto downstream systems with a single push of a button. These features make ScoreFast easy to integrate into existing business processes without any disruption or cost overheads.  

The canonical engagement center use cases deal with Agent Ranking, Caller-Agent Mapping, Dashboards with cross-sell or upsell with detailed presentation of customer profiles, and Engagement Center workload optimization.  All these yield substantial improvements in customer satisfaction and improved top-line and bottom-line benefits.  The ScoreFast™ engine delivers unique value throughout the predictive analytics insights-to-decision process, with a dramatically lower total cost of ownership.

Conclusion: Consumer demands will continue to change and technology will continue to evolve.  Predictive Analytics will play a crucial role in helping businesses to adapt to the changing world.  Engagement centers that adapt to changes to empower agents while keeping customer satisfaction in mind will become “relationship centers.”  As Philonenko says, analytics allow human beings to be smarter, act faster, evolve and grow, all of which are essential for an agile relationship center.

  • Share/Bookmark

ScoreFast™: Predicting hospital readmission rates for diabetic patients

June 16th, 2016

I am a biologist by training and new to Machine Learning (ML).  As a member of the ScoreData team, I wanted to take the opportunity to try to analyze bioscience data on ScoreData’s ML platform (ScoreFast™).  This project allowed me to use my skills as a biologist and my passion to learn new technologies and concepts. After searching for a public data set to experiment with, I decided to use Diabetic 130-US hospitals (1999-2008). This data set has a relatively large size (about 100,000 instances, 55 features).  Also, it has a published paper  linked to the data.

ScoreFast™, the web based model development and management platform from ScoreData, is built as an enterprise grade machine learning platform.  Its data and model management modules are dashboard driven, and intuitively easy to use.  ScoreFast™ supports many algorithms and with its in-memory computation, it makes it easy to build and test models quickly.  As someone new to machine learning, the easy interface to build models made it easy for me to get started.

The goals of this project were to:

(1) Use ScoreFast™ platform to find correlations between data features and readmission of patients to hospitals within 30 days,

(2) Compare results from the ScoreFast™ platform with other platforms such as IBM’s Watson, Amazon’s ML platform and with the published paper, and

(3) Build a predictive model on ScoreFast™ platform.

First, I read the paper to understand a bit more about the data.  In the paper, the authors hypothesized that measurement of HbA1c (The Hemoglobin A1c test, HbA1c, is an important blood test that shows how well diabetes is being controlled) is associated with a reduction in readmission rates in individuals admitted to the hospital.  The authors took a large dataset of about 74 million unique encounters corresponding to 17 million unique patients and extracted data down to 101,766 encounters.  The authors further cleaned up the data to avoid bias: (1) Removed all patient encounters that resulted in either discharge to a hospice or patient death. (2) Considered only the first encounter for each patient as primary admission and determined whether or not they were readmitted within 30 days.  The final data set consists of 69,984 encounters. This is the dataset we used to do data analysis and to build a predictive model.

A few key details about the data set:  1) Although all patients were diabetic, only 8.2% of the patients had diabetes as primary diagnosis (Table 3 of paper), 2) HbA1c test was done for only 18.4% of the patients during the hospital encounter (Table 3 of paper).

We used the same criteria, as mentioned in the paper, to pare down the data to 69,984 encounters.  To reduce bias, I did not identify the parameters that were found to be significant in the paper.  I wanted to see which parameters the ScoreFast™ machine learning platform picked up to be significant to the hospital readmission rate and how well they compared to the analysis in the paper and also to other ML platforms.


Data Analysis: Understanding the data

The data set had 55 features.  After ignoring some of the features, such as Encounter_ID, Patient_No, Weight (97% of the data was missing), and Payer_Code, which were either irrelevant to the response (readmission to hospital) or had poor data quality, the total number of relevant features was 46.  The first step was to do a data quality check on the 69,984 records (46 features) to understand the quality of data. Out of 46 features, 23 were medicine related, and of those, 22 were diabetic medications and one was cardiovascular.  The data had 17 features that had constant values.  The diagnosis features (diag_1, diag_2, diag_3) had icd9  (the International Classification of Diseases, Ninth revision; standard list of 6-character alphanumeric codes to describe diagnoses) codes.  These codes were transcribed to their corresponding descriptions (Table 2 of paper).  Diag_1 was renamed to ic9Groupname, diag_2 renamed ic9Groupname2 and diag_3 to ic9Groupname3.  By doing this aggregation, the correlation between any disease/procedure and hospital readmission was easier to see.

When I used the ScoreFast™ platform to build the models, I got AUC (Area Under the Curve) in the range of 0.61-0.65 (Table B, below), which indicated that the data was not of very good quality.  This was confirmed by both IBM Watson which gave a data quality score of 54, and Amazon ML platform which also had AUC 0.63 (Table A).

Building a predictive Model on ScoreFast™ Platform

I found it easy to upload the data set onto the ScoreFast™ platform.  The platform easily splits the data into train and test sets.  The platform has options to build four different classes of models, among others: GBM, GLM, DL (Deep Learning) and DRF (Distributed Random Forest).  ScoreData plans to add more algorithms in the near future.   Without knowing the details of the algorithms, I was able to build the four models easily on the platform.   All the four models had very similar results with AUC ranging from 0.61-0.65. 

Once the models were built, I could click on the “Detail” links to learn more about each model; the MSE, ROC, thresholds and the key features.  The top ten features with significant correlation with hospital admission rates were similar on GBM and DRF models (Table C, below).  I was intrigued to see that the top ten features with significant correlation with hospital readmission rate, were different on the GLM model and DL model (Table D, below).  After discussions with my colleagues, I understood the sensitivity differences between the various algorithms.  I realized that one can use different models depending on what someone is looking for and data available.

Comparison with the Amazon and IBM Machine Learning Platform

The IBM Watson platform, showed the following three parameters to have significant correlation with the hospital readmission rate: Discharge disposition (where the patient is discharged to), the number of inpatient visits to hospital, and time spent at each hospital visit. Their interaction graphs are very useful to understand visually.  When the data was analyzed on ScoreFast™, the three significant features on IBM platform were also part of the top ten significant features on the GBM and DRF models of ScoreFast™ (Table C for ScoreFast™ platform and Table E on IBM’s Watson).

The Amazon ML platform did not provide a way to visualize the top features of the model. It does provide the cross validation result. The percentage of correct was 91% with an error rate of 9%. The True positive (TP) was very low. This was again validated in the confusion matrix results on the ScoreFast™ platform, as shown in Table F.

The top features, model accuracy and results on ScoreFast™ platform were very similar when compared with IBM Watson and Amazon ML platform.


Predicting admission rates for new patients

With the push of a button, the models can be deployed in a production environment in the ScoreFast™ platform. 14,462 (20% of original data) rows were used to predict the models. Using the batch prediction interface on ScoreFast™, I tested the hospital readmission models and the results are shown below (Table G).

The model was able to correctly predict 92% of population which do not require readmission (True Negative). Of the remaining 8%, for GBM model, 9 out of 1299 were correctly predicted (TP) but False Negative was high (1290).  Deep learning had a slightly better True Positive rate (19 out of 1299).  The DRF model did not give good True Positive predictions. The low TP rate was expected as we had observed during building of the models.

Table H (below) shows how threshold impacts the accuracy of the model (GBM).  Maximum accuracy is obtained with a threshold of 0.644. If threshold is tuned to increase the TP, it starts to impact TN. In this case, both TN (a patient who should not be readmitted is not readmitted) and TP (a patient who should be readmitted is readmitted) are both important.  In general, the threshold can be used to tune the model for desired result.

To improve the TP and/or reduce the False Negative (FN), we need more data samples as well as additional features. We had observed from the beginning that the data quality was not that great and the model had an AUC of 0.63.


As mentioned earlier, the data quality was not good.  The paper mentioned that there was a significant correlation between the HbA1c test not done during the hospital stay and the readmission rate.  The DL model on the ScoreFast™ platform also showed a correlation between HbA1c test not done and hospital readmission rate.  The DL model also picked up Circulatory diagnosis as a significant parameter which was also mentioned as a significant feature in the paper.  Again, due to the differences in the algorithms, the other models did not pick up these features.  As a next step, I need to understand the algorithms better to answer how data/features affect the choice of models for building predictive models and to understand the impact of each feature.

The significant parameters with a high correlation to hospital readmission found on the ScoreFast™ (GBM & DRF models) and in the Watson platform were comparable to the data in Table 3 of paper.  The data in the Table 3 of paper shows a correlation between the hospital readmission rate and (1) discharge disposition, i.e., where the patient is discharged to after the first visit (discharge_disposition_id), (2) primary diagnosis, and (3) age (higher chances of a person older than 60 years to be readmitted).

I found ScoreFast™ easy to use.  It was easy to load large data sets and analyze data.  The platform provides a choice of many different kinds of algorithms.  These can be used either to confirm the significant correlations between parameters or can be used for performing different kinds of analysis.  It supports model versioning which allows one to try different variations of models keeping track of the performance of each model.

What’s next? 

I would like to get to understand the details of the algorithms of the different models a bit deeper, so I can figure out how to choose different models for different purposes. As a next step, I plan to understand the data a bit more and try to see how I can improve the quality of data. One idea is to club the medicines from 23 features to 2. I also want to understand the feature correlations and the importance of variables.  Also, I plan to work with my data scientist colleagues to tweak the algorithm configurations to see if I can improve the model accuracy.

Beyond this diabetic data analysis project, I plan to experiment with a few more datasets in order to gain more insight into how machine learning and ScoreFast™ can be used to get actionable insights from clinical or medical data.


I would like to thank Prasanta Behera for helping me guide this project.

  • Share/Bookmark

ScoreData announces Promotion for Avaya Customers

June 2nd, 2016
Promo Title: Agent Ranking Promotion
Promo Description: ScoreData Corporation is pleased to offer Avaya customers 3 months of ScoreFast license at no charge for Agent Ranking based on trailing six months of data.   This will be extremely useful as you build your caller agent mapping solutions.
This offer expires on December 31, 2016.

  • Share/Bookmark

Churn Management for the Masses using ScoreFast™

May 10th, 2016

There are broadly three ways for a business to grow and defend its current revenue stream: by acquiring new customers, by cross or up-selling to existing customers, and by improving customer retention. All three have a cost associated with them and the businesses are interested in the ROI on their investments. Acquiring new customers may cause anywhere between five to fifteen times more than selling to an installed customer-base.  For a consumer facing business, their ability to set up robust processes that predict their consumers’ propensity to churn well in advance, and with enough time to run retention campaigns, and stop critical consumer segments from leaving, makes sound business sense and essential to building a robust business.

Companies have been employing machine-learning techniques on their data to find patterns that signal their customers’ propensity to churn. Historically, companies worked with analytics consulting companies that specialized in developing churn prediction models for specific industries and functions. These consulting companies used established processes to develop churn prediction scorecards for each significant consumer segment in their consumer base. Here are some example steps:

First, ETL & Data wrangling: the efficacy of predictive models is based on the datasets used for machine learning (model development).

Second, Feature engineering: the process of defining customer attributes through historical information about them, followed by identifying predictor features (attributes with highest impact on churn behavior) through statistical algorithms and numerical methods.

Third, Model development:  This is usually followed by defining time frames (input data window, output/prediction window), and model training and validation. The models and churn propensity scores thus developed are used to identify future churn propensity based on recent customer behavior.

Fourth, Deployment: These inputs are plugged into churn retention campaigns for specific customer segments.

In this traditional paradigm, churn prediction and management was only accessible to mid-to-large size companies, those that could build data science teams or hire analytics consulting companies. This traditional model of churn management is no longer relevant today for two reasons.

First: new age SAAS prediction services like ours, ScoreData’s ScoreFast™ are bringing down the infrastructure investment and upfront costs substantially, abstracting away the science of churn propensity prediction- making it easier to use for the business managers, all of it contributing to make churn prediction and management accessible to businesses of all sizes.

Second: Cloud, social media, IOT and ubiquitous devices are redefining consumer touch points and the competitive landscape for businesses every day. Companies today are employing smart customer engagement solutions at multiple layers. In this new world, by the time the manually developed churn prediction models get deployed, the underlying assumptions- the indicators of churn behavior, may have already changed. This means your churn prediction models are obsolete by the time they are deployed. Companies need systems that are nimble on their feet, systems that keep up with ever changing business landscape and keep giving superior results.

Let’s try to understand these concepts with the help of some real world examples of churn management in business.

Churn Management for Telecom

First, let’s look at the telecommunication industry. It is one of the earliest adopters of churn management solutions and among the heaviest users today. The landscape of churn prediction and management has gone through a sea change in the last couple of years in this industry on account of big data analytics.

In the Telecom industry, customers (subscribers) are known to frequently switch from one company to another and this voluntary churn has always been a critical business concern. It is a subscription based business model where the majority of revenues come from recurring monthly subscription fees from existing customers.

Although telecom companies have accumulated a lot of domain knowledge about the drivers of churn behavior, they cannot predict (and contain) churn basis these static insights. For example, new plans from competitors are a well known driver of voluntary churn. Companies offer lucrative data and voice packages for new customers but not for existing ones, frequently resulting in customers moving from one company to another to get a better plan.

But subscription plans are very dynamic in nature, where new plans are being launched every day and the whole landscape changes within a matter of months. So you cannot predict future churn based on a competitive plan landscape of today. Moreover mobile phones are no longer just telecommunication devices; subscribers’ needs have a very strong social purpose as well (due to social media, image/ media sharing etc.) and these social attributes are nowhere captured in regular telecom data sets.

The point is you cannot manage churn effectively solely based on the known reasons of churn behavior. And this is where ScoreData has dramatically improved the business problem. What you need is a strategy that allows you to develop predictive models that quantify current churn drivers and keep up with changing landscape of churn behavior at all the times. The model performance necessarily decays over time and you need systems that keep fine-tuning the models whenever performance decays beyond the accepted thresholds.

Churn Management for the Weight/loss management

Let’s look at another industry, and another churn management problem. The “weight loss/management” industry has a big customer churn problem. Consumers subscribe to plans for x months and then discontinue a program even though they may still need to continue the program to experience full benefits from their program. One very important driver of this churn behavior is the difference between expectations and reality. Customers sign up with unrealistic expectations and that often results in disappointments even with modest results (moderate weight loss).

Although this is a well-established phenomena and companies do try to handle expectation management for existing customers there are several other, more important drivers of the churn behavior as well. And these drivers of churn behavior keep changing with time, location and other macro parameters.  If the business needs to incorporate hundreds of additional factors to determine which features are really causing churn, you may want to compare and contrast several models with several data sets while experimenting with new signals or external data sets. You need systems that develop churn prediction models, capture these signals from the data, and implement the monitoring on a continuous basis. Systems like these enable businesses to understand the churn behavior of their important customer segments and devise retention strategies.

ScoreFast™, the web based model development and management platform from ScoreData, is built as an enterprise grade churn management system. Its data and model management modules are easy to use, dashboard driven and intuitive. It chooses models for specific use cases after trying out hundreds of algorithms internally and selecting the one with best performance metrics. ScoreFast has collaboration features encouraging sharing and collaboration within large cross functional teams, and access control features that are designed keeping in mind the specific needs of ScoreData’s big enterprise clients. The collaboration features encourage cross functional knowledge sharing and innovation within the companies.

The platform has built in hooks to link raw data feeds into the system and its one push provisioning features mean models once developed and tested can be deployed onto downstream systems with a single push of a button. These features make ScoreFast easy to integrate into existing business processes without any disruption or cost overheads. ScoreFast’s real time self learning module makes sure your model performance never goes below the statistical or business validation thresholds that you setup. As soon as the performance drops below the line, it triggers a retrain- without any human intervention required. This means your churn prediction models are always on top of the game and all relevant signals are taken into consideration while scoring a consumer for their propensity to churn.

ScoreFast has features for advanced users as well: those who want to peek under the hood and customize the models. The platform is not just for the business user, it empowers the data scientist to get into the specifics of model definitions, analyze performance comparisons, and fine tune the models.

ScoreFast is the market leading machine learning and model management platform that is making predictive model development, specifically churn prediction accessible to companies, regardless of their size, with identical predictive power for all. With ScoreFast’s cloud based architecture, and built-for-business-manager interaction designs, the paradigms for churn management are changing very quickly. Companies that respond to these changes and efficiently leverage future ready platforms like ScoreFast for their churn prediction and retention strategies are going to have a substantial competitive edge in the marketplace of today and in future.

- Mudit Chandra and the ScoreData team

  • Share/Bookmark

Rapid Development and Deployment of Machine Learned Models

April 20th, 2016

In the past ten years, we have seen a dramatic rise in the use of machine learning techniques to build predictive models.  In the rapidly evolving Predictive Analytics tools landscape, more and more applications are using machine-learned models as part of the core business process. This trend will continue to grow. From an evolutionary perspective, according to Gartner, the landscape is moving from descriptive to prescriptive analytics.

Enterprises are being challenged with what to do with the tremendous amount of data being generated within the enterprise.  If you look outside Silicon Valley, there are not too many data scientists available or not every enterprise can afford to pay the high prices to perform the analysis.  Even to start a project, the cost is high and value may not be quickly realized.

In order for ML technology to be used in all kinds of enterprises (and not just tech savvy ones), new generation of platforms/tools need to be easy to use and at reasonable cost. Platforms must make it easy for even business professionals (not just data scientists) at companies to be able to use ML techniques to improve business outcomes. Business outcomes span the customer engagement journey from repeat business, to enhanced customer retention and customer satisfaction.

We have seen many recent announcements of Deep Learning technology used by Google, IBM, Microsoft and Facebook and private companies such as H2O among others and now some of those technologies are being open-sourced by enterprises. We now have more than dozen open-sourced technologies that one can leverage to get started. However, there are a few challenges, which we need to address to make platforms easy to use.  Higher-level abstractions need to be defined in the platform for not only for data scientists but also for business users. It would be great, if my product manager can use the platform to find what models are running in the platform and even can build a model using the configuration I used and set up for an A/B test. Why not? Yes, big tech companies have built tools to support that but it is still a struggle for smaller departments, mid-sized companies and startups.  Personally, I have experienced that challenge in large technology companies as well as in startups.

Predictive Analytics and Machine Learning are such oft-used overloaded phrases that there is a tendency to overpromise the benefits. It takes time to show value to a business and the “start small, move fast” philosophy comes in handy. So, the need to be able to start a project at a low cost is critical. Another point to note is that there is no substitute for on-line testing. Don’t use back-tested results to “overpromise” the impact to business.

The best way to approach testing is rapid iteration. Let’s look at key features required in a platform to achieve rapid development. Let me start with a cautionary note – I am not targeting tech-heavy enterprises which have big teams of data scientists, but rather enterprises which want to leverage this technology to solve their business problems and can afford a small data team to prove the worth before investing more. Cloud-based solutions from small and big companies (e.g., Amazon ML) are now available to test out new ideas at a smaller cost.

Let me start with a problem that I ran into in an ad tech area recently and consider that use case to discuss different features that are important for the ML platform. Even if I target “US” audiences, invariably we find that some percentage of traffic is being tagged outside the targeting criteria by reporting system reporting systems (such as DFA). Those fraud impressions cost real-money.  We need a quick way to detect it and not bid on those suspicious impression calls via Real-Time Bidding (RTB) system. This means we need a platform that can be used to continuously update the model every hour.

So, the ability to build models faster with automation for production use at a low cost is important. Let’s look at what key features the next generation of machine learning platform should support.

Understanding Data

Data preparation is a big effort by itself and many tools / platforms may be used to process, clean, augment, and to wrangle the data.  This is a big topic by itself. Right now, 60-80% of the time is spent on data preparation.  For the sake of this blog, we are assuming that data has been prepared to build the models (okay, I will come back and do a post later on data wrangling – hopefully :-) ).

There are a couple of things ML platforms can provide for additional insight from data such as understanding the “goodness of data” for a good model fit.  A data set with mostly constant data is not good – even if it is complete – it will be hard to build a good model.  Simple statistical properties like outlier bounds, skew, variance, correlation, histograms can be easily computed. However, the platforms should go to the next step, i.e., provide a “data quality scorecard” at a feature level and overall. But what does a score of 86 mean?  Is it good or bad? That’s where additional insights and recommendation can help.  It can show the score “compared to” other similar data sets or from a configured well-known feature.  The system can be trained to provide that score and even better a model can be built to generate the quality scorecard.

When one is dealing with 100’s of features, it is quite hard to review data properties – so a recommendation/hint can go a long way to understanding the data and making sure highly correlated/dependent variables are ignored from the model. (Note: Highly correlated variable will be removed in the feature reduction process)

Ease of Use in building Models

Ability to build models for common problems easily is important to platform adoption and broader support. Platform should provide solution templates so that one can get started easily. If I am building a customer churn prediction, it is not hard to build a workflow that can guide the user in easy steps. Can the past models built for the same use cases guide the user in feature engineering?

There is a wide array of algorithms such as GLM, GBM, Deep Learning, Random Forest, and are now available in most of the platforms. Platforms supporting in-memory computations are able to build the models faster and quicker at a lower cost. This is important since newer use cases need the ability to be able to adapt and build to real-time use cases and need the ability build a model frequently (every hour per say).  Start with simple algorithms such as GLM and GBM; they are easier to understand and tune.  Whenever a data scientist in a team comes up with a proposal to solve a problem with complex algorithms, ask them to take a pause and see how to get started with a simple algorithm first and iterate. The iteration is more important than finding the exact algorithm.

Productizing the models

Once models are built, it is critical that, they be enabled for production quickly.  There is no better test than running in production with a small percentage of on-line traffic and getting some results. The quicker, it can be done, the better it is. The platform should support experimental logging of scores. This way you can get scores on your model on production traffic without impacting production application. This functionality is a much-required requirement for data scientists and will enable them to experiment models quickly.

In the past, models were built, converted to code and pushed to production system taking weeks. The new generation of SAAS-based ML platforms have integrated model building and scoring into the platform so that models can be enabled for scoring with click of a button and can be scaled easily. PMML adds portability to the model – although it is never works in an optimized way like the models that are built and scored in the same platform (optimized). So, PMML gives flexibility but sometimes at the expense of optimization – a normal tradeoff, which we encounter in other technology stacks also.

Quick iteration is the best way to know the efficacy of the model and make tuning adjustments.

Visibility & Collaboration

Data science is still a black box for many inside the company.  What models have been built, what models are being used for A/B testing for a certain application, etc., are hard to get at. If you ask a question few months later, what training data was used to build the model, it is not an easy answer.  Many tech-savvy companies have built proprietary tools to manage it.  Data scientists are now using wide array of tools such as R, Python, H20, Spark/MLib among many others. Integration with other platforms /tools is important in providing visibility to peers and fostering collaboration.  How models are built in this wide array of tools can be organized and learnings can be shared should be part of it.

A platform which make it easy to organize/tag models; allow collaboration, and keep track of changes will help speed innovation. The more open it is, the better the chance of success.

Model Templates  

There will be complex problems for which one has to do lots of analysis, feature engineering and build sophisticated algorithms but there are classes of problems for which using simple algorithms and solutions will be good enough. The platform should be easy enough for product managers / business analysts to be able to build models. They should also be able to leverage model configurations from their data scientists to play around with new data.  It should be easy to compare scores of multiple models to see how the new model stacks up.  Providing model templates to common set of problems / verticals can help new users to leverage the platform better.

Data drift: Adaptive to data change & Model Versioning

In most organizations, retraining of models is scheduled, once in 6 weeks or a quarter. These days’ data is changing at a much faster rate and it is important to leverage it sooner. So, the platform needs to provide important data characteristics changes and feature level. These point to data pipeline issues which needs to be addressed sooner since it impacts the model performance.

It will be a good tool to compare differences between two models; configuration, and feature differences. It will be a good analysis tools to understand how data is changing over time and the impact of it.

Note, tech-savvy companies will have lots of tools and a big team of data scientists and they will build custom tools – we are not talking about them.  We are talking about many companies which cannot afford a big data science team and are not in the technology area, and they need tools which are simple to use, can help speed adoption of machine leaning into their business. Cloud based SAAS platform are the best way to get started at a lower cost.

At ScoreData, we have built ScoreFast™, a cloud based platform that is geared for such businesses – simple to use at lower cost. Once a model is built, it can be enabled by a single click for scoring. The model is optimized for speed.  The models can be shared among peers so that they can see what features are being used as well as leverage the configuration to build models using their data.  Configuring a data quality ScoreFast™ Scorecard for each feature and of the overall data set with recommendations to the modeler.

The next generation of ML platforms will make it more transparent, collaborative and easier to use at a lower cost.


  • Share/Bookmark

Predictive Analytics for Financial Services Industry and ScoreFast™

April 4th, 2016

History shows that financial Services industry has always been an early adopter of new age technologies. And even more so with utilizing their data assets for business benefits, which is essentially what data analytics is. IT is because financial services is, at the core, business of making profits over the spread between the earnings on the assets and the expenses on liabilities over a reasonably big customer portfolio.

So the industry by definition is data intensive and success depends a lot on an organization’s ability to understand its customers, their behaviors, and to leverage those insights in day-to-day operations. This is the reason why during the early computing era in the sixties and seventies, banks and financial services institutions were the first businesses to leverage their historical datasets for important business functions like credit decisioning.

Today with the advent of IOT and big data when almost everything we do is being captured at an unprecedented rate. When cloud technologies both deliver new data sources and provide a scalable, pervasive ecosystem for analytics; the same DNA of the financial services industry is fostering an era of unprecedented innovative usage of data and machine learning technologies. On the one hand, traditional financial services firms are finding novel ways to leverage machine learning and big data to optimize standard business processes, while on the other hand new age FinTech firms like Klarna, Notion and Affirm are using all this technological power to redefine the industry itself. One example would be using social media signals and Internet footprint in the credit profiling and decision process, making it more robust and at the same time reducing turnaround times.

Data is the bedrock on which machine learning and predictive analytics stand. So in order to look at how predictive analytics and new advancements in these technologies are changing the banking and financial services industry, let’s look at all the different types of data and signals that are available to these businesses.

At a broader level, there are two types of datasets that a company has access to – business data (the data a company gathers while conducting its business- customer demographics, transactional datasets) and outside data, which in turn can be either public data (social media etc.) or private datasets available for restricted usage (e.g. credit ratings). Companies use these datasets for all sorts of purposes, but essentially to understand their customer segments, their habits, behaviors and preferences; and use these insights to inform their (the company’s) business decisions.

If we list core business functions in the financial services industry – from a business standpoint as well as from the perspective of a variety of predictive use cases for this industry, we can list the following functions: Sales and Marketing, Risk Management (fraud risk, credit risk etc.), Customer Relationship Management, and Collections and recoveries

Having established the categories of datasets and important business functions, let’s now look at predictive use cases for various business functions one at a time:

Sales and Marketing

Marketing involves investing money into campaigns in order to lure new (or old) customers into the business. In order to allocate marketing budgets optimally, it is vitally important to understand the returns on investments from various marketing campaigns historically. One important predictive use case in this arena is Promotion Response Models – which understand the interplay of promotions and resultant responses (footfalls, click-through-rates (CTRs), sales etc.) dependent on historical data. Essentially, these models help companies simulate potential sales (or other relevant metrics) basis specific promotional dollars allocation and run different scenarios according to business strategy. And then use all these simulations to come up with winning budget allocations for maximum ROI.

Another evolved area of predictive analytics application is Sales Forecasting Tools. Being able to use historical sales trends, market directions, macroeconomic data and other relevant signals and accurately foresee a future sales trend is of primary importance to any business, more so for large-scale financial services organizations. Channel Optimization is another area that is a very important predictive use case in sales and marketing functions. It entails devising an elaborate channel wise budget allocation plan for maximum ROI.

To summarize, fundamentally the focus of predictive analytics in sales and marketing functions is on improving marketing efficiency and maximizing the ROI on sales campaigns.

Risk Management

Being able to accurately understand underlying risks (be it fraud risk or credit risk or other risks) and use this information efficiently for business benefit is at the core of success in the financial services industry. And this is why one of the first use cases of predictive analytics in the industry was in the area of credit ratings. Today, with a lot of diverse datasets available, the industry is innovating everyday in order to improve risk management functions.

In Credit Risk Assessment functions, especially at the time of onboarding, new age FinTech firms are using all sorts of signals – from customers’ social media footprint, to social network maps (friends/ colleagues/ family), along with more traditional data sources like demographic and profile information and credit history to improve the credit decision processes – making it more efficient as well as cutting down timelines. The decision cycles are getting shorter without compromising on decision qualities, and in fact in many cases improving them. The whole cloud based distributed computing ecosystem and mobile technologies have made high end computing resources available for innovation and opened up the marketplace. New age FinTech companies (e.g. LendingClub, Affirm, Klarna etc.) are leading efforts in these areas.

A lot of predictive analytics is also used in Credit Line Management- a dynamic assessment of credit line that is to be extended to a customer based on her profile, past behavior and most recent transactional signals.

Another key area for predictive analytics applications in risk management functions is Fraud Risk Management. This comprises of all fraud risk exposures for a bank- from the time of sourcing to all transactional fraud exposures during the lifetime of a customer relationship. Financial services companies use predictive analytics to predict the propensity to fraud at customer levels as well as at transaction levels, and use this information in their risk management decision to establish acceptable risk criteria. Cloud based distributed computing architecture is allowing companies to be very nimble with their fraud risk containment decisions. Companies are using the most recent signals and trends to inform their fraud alert systems, helping them tread the fine balance between customer experience and fraud risk exposure at all times.

Customer Relationship Management

Once a customer comes on board, till the time the relationship ends, all interactions with the customer can be labeled under customer relationship management. Predictive analytics plays an instrumental role in various CRM functions. One such high impact area is Cross-sell/ Up-sell. Selling to an already existing customer makes more business sense than acquiring a new customer. If you do it right, you not only deepen your existing customer relationships but also invest your marketing dollars in the most efficient place. And on the flip side, if you don’t do it right, cross selling to uninterested customers can result in irate customers, and eroding the brand equity. Financial services companies use customer behavior data along with their demographic information and other signals to accurately predict customers’ interest for other products. Today’s machine learning tools can ingest even the most obscure signals to predict the propensity of customers to react positively to cross-sell / up-sell offers.

Another important area of predictive focus is Customer Churn. Acquiring a customer can involve big investments and churn takes away the opportunity of a business to make good on a customer relationship. Being able to successfully predict customers’ propensities to churn in a given period gives businesses enough time to run preventive campaigns and contain customer churn.


Collections is one of the core functions in the financial services business. A company’s ability to collect efficiently on its debts in today’s market depends a great deal on their ability to use the historical data efficiently. This enables them to – preempt the possible default events, predict the payment propensities etc. This helps companies to optimally allocate their collections budgets. Some established predictive use cases in the collections function are variousDelinquency Prediction Scorecards and Payment Propensity Prediction Scorecards (for recoveries portfolios).

With such a complex array of functions to perform in the spectrum of customer engagement- speed of execution, speed of anticipation and speed of delivery of offers to consumers is essential, especially for the Banking and Financial Services industry. ScoreData, with its ScoreFast™ engine makes it possible for all sizes of financial services companies to make their decisions in real-time or near real-time in the broad spectrum of applications in Sales and Marketing, Customer Churn, Risk Management, and Customer Relationship Management.

The most important need for any consumer-facing industry such as Banking and Financial Services is customer engagement. In the three years since ScoreData was founded, they have focused on building solutions for consumer facing industries. In order to further assist banks to improve the responsiveness and effectiveness of their sales and marketing campaigns, and to implement cross-sell strategies to assess customer loyalty, the analytics platform has a variety of pre-built model offerings in consumer analytics, risk analytics and other areas like churn management etc.

The ScoreFast™ platform fosters widespread analytics consumption and insights usage across organizations and has an easy to use business dashboard driven data/model development and deployment facility.   Comprehensive centralized model management with version control means less duplication, more collaboration, and ease of diagnosis when model performance deteriorates.

At run time, models update themselves incorporating a wide variety of company internal, and third party and regulatory data. The platform is flexible enough to ingest new data sources or tune out old data sources during the model building process.

This ensures that the most accurate models get deployed over time. ScoreFast™ then compares and contrasts results from hundreds of in-memory-built models with these algorithms. This is a significant improvement over legacy practices, thus shrinking model-to-market times from weeks or months to days or hours. ScoreFast™ is an ideal platform for new ecosystems in the Banking, Financial Services, and Insurance Industries.

Mudit Chandra
April 03, 2016


  • Share/Bookmark

ScoreFast™: fostering a new era in customer engagement for the Insurance Industry

March 17th, 2016

Insurance companies, by the very nature of their business, have constantly growing data sets that need to be analyzed continuously. As well, there is a growing demand for widespread creation and consumption of up-to-date analytics-driven insights within the enterprise.   However, these enterprises are constrained by aging tools, solutions, pricing, and deployment models. Furthermore, there is a growing shortage of Insurance-domain savvy practitioners of   Big Data Analytics.

New generations of analytics platforms, technologies, and frameworks, now make rapid and automatic development of models, and in-memory computation possible.  These frameworks also allow the absorption of legacy data models into these new platforms.  ScoreData Corporation has built the ScoreFast™ platform that combines best of breed algorithms and state of the art open-source frameworks to deliver models and run-time environments with dramatically short turn-around times.

At the heart of ScoreData’s model development platform are real-time self-learning models.  These ensure that you do not have to hire large teams to build and maintain your models.  This ensures the lowest total cost of ownership of any modeling platform in the industry.

“The world is moving from ‘proprietary’ tools for data scientists to Open Source tools and solutions for business users. And this is precisely what makes ScoreData a company to partner with for the future,” says Vasudev Bhandarkar, CEO of ScoreData. “Our tools are designed from ground up for cloud and customized for insurance industry-specific solutions in Claims, Risk, and Insurance product recommendations.”

The most important need for any consumer-facing Industry is customer engagement.  In the three years since ScoreData was founded, they have focused on building solutions for consumer facing industries.  In order to further assist insurance firms to improve the responsiveness and effectiveness of their sales and marketing campaigns, and to implement cross-sell strategies to assess customer loyalty, the analytics platform has a variety of pre-built model offerings in consumer analytics, risk analytics and claim analytics.

The ScoreFast platform fosters widespread analytics consumption and insights usage across organizations and has an easy to use business dashboard driven data/model development and deployment facility. “Comprehensive centralized model management with version control means less duplication, more collaboration, and ease of diagnosis when model performance deteriorates,” explains Bhandarkar.

At run time, models update themselves incorporating a wide variety of company internal, and third party and regulatory data.  The platform is flexible enough to ingest new data sources or tune out old data sources during the model building process.   This ensures that the most accurate models get deployed over time. ScoreFast then compares and contrasts results from hundreds of in-memory-built models with these algorithms. This is a significant improvement over legacy practices, thus shrinking model-to-market times from weeks or months to days or hours.

Citing an example, Bhandarkar explains how ScoreData assisted one of their clients holding a diversified financial portfolio devise a superior and much improved cross-sell strategy, which resulted in significant yield improvement. Through ScoreData’s expertise, the client was able to increase their cross-sell penetration by 200 percent within two quarters of implementing ScoreData’s models on their portfolio of over half a million customers.

ScoreData was founded with zeal to providing innovative and custom data analytics solutions to the ever-changing needs of today’s businesses. They work closely with their clients and business managers to deliver predictive insights for their business. “Newer product releases of the ScoreFast engine will feature special packages for the insurance industry. We expect this will result in quick turn-around times in claims fulfillment, and optimal package offers for insurance consumers,” concludes Bhandarkar.

ScoreFast™: A scoring engine to build custom models, with a service to deploy either on-premise or cloud architectures.

Quote: “The ScoreFast engine offers significant improvement in model-to-market times for scoring, thus shortening the processing times for bringing insurance solutions to market.  Furthermore, ScoreData teams assisted by ScoreFast™ are standing by to help you build the most effective Claim and Risk solutions for the market”

Location: Palo Alto, CA


  • Share/Bookmark

Data Scientists for the 21 Century – Are we in for a drought?

March 14th, 2016


Recently, a TechCrunch article, citing a McKinsey study, noted that by 2018 the number of data science jobs in the US alone would exceed 490,000, whereas there would be fewer than 200,000 data scientists to fill those positions,  Globally, the demand for data scientists is projected to exceed supply by more than 50% by 2018.  Yet, one has to be impressed by the number of courses and degrees offered by US universities in data science.  For example, the following lists programs offered by California universities,

Other states have also been equally proactive; please see the above link for more information.  Add to that, online courses offered by Coursera, Udacity, edX, etc., and the practicing engineer has a wide range of choices for specializing in the data science field.  So, is it going to be gloom and doom when it comes to filling data science positions of the future?  I, for one, do not think so.

On February 11, I had attended UC Berkeley’s EECS annual BEARS meeting,  Berkeley has been at the forefront of technology development for data science with significant contributions, such as the Berkeley Data Analytics Stack (BDAS),, the main component of which is the Spark (now Apache Spark) software.  Berkeley has been equally innovative in developing courses for budding data scientists,  Prof. David Culler talked about the “Foundations of Data Science” class, which is taken by about 500 undergraduate students; please see the attached picture of the class in session.  80% of those students noted that the class was outside their major field of study.  This means that data science is becoming a fundamental tool, much like mathematics.  And if Berkeley alone trains that many students with basic data science skills, how could we possibly be headed for a shortage?

To address the question, one has to examine what the new field of data science encompasses.  It includes at the infrastructure layer data storage and retrieval systems, such as HDFS file system, NoSQL databases, etc.  Above that layer comes tools for parsing, auditing, and cleansing the data.  Then comes machine learning models to extract insights out of the data, and finally visualization tools for humans to extract and consume further insights from the output of the models.  Do we really need the all-in-one data scientist with all the above skills across the stack?  If one looks at the software engineering field, one might be tempted to argue that the ideal software engineer should be adept at developing software across the system and application stacks.


Figure 1: Typical Data Consumption Infrastructure Stack


Yet, in reality, software engineers are specialized in specific areas, such as storage systems, databases and middleware, networking software, application software, etc.  The same is going to be true for data science as well.  Teams of software engineers, statisticians, and data analysts are likely to effectively fill the needs of the data science stack.  Some understanding of the stack would certainly be useful, but the specialists focusing on their areas of expertise would be the best approach.  This is already happening in the industry.  Moreover, a number of open source data science and machine learning toolsets, such as Apache Spark, H2O, Google TensorFlow, etc., have made it easy to build complex predictive analytic models.  So, I conclude that engineers and data analysts working together in teams will indeed fill the needs of the industry when it comes to data science for 2018 and well beyond.  Comments and feedback from the readers are most welcome.

Dr. K. Moidin Mohiuddin
March 11, 2016


  • Share/Bookmark

Predictive Analytics

October 24th, 2013

Usually predictive analytics provides a tool or predictive model that outputs a score(propensity of the event) that can be used for targeted campaign across the processes like collection, recovery, customer acquisition, cross sell/up sell, customer retention and churn prevention etc.

Predictive Analytics

Predictive Analytics

Predictive analytics is a proven technology and being used by industry successfully for decades. Predictive analytics helps the industry to make smarter decisions. It increases the precision, consistency and speed of the decisions about customers and their prospects.

Some of the examples of usages of predictive analytics in industry are as follows:

Telecom:  A large US based Telecom Company used our predictive model to identify the customers who are more likely to not clear their dues and eventually get terminated in next six month. They used the scores to design their customer reach program like sending emails to self cure customers.

Banking: A leading Indian bank used predictive analytics to cross sell their LAP(Loan against property) to their Current account and saving account portfolio.

Retail:  A London based retailer used predictive analytics to forecast their weekly sales product wise and used that information for inventory management.

Insurance:  We helped an insurance company to identify the customers who are more likely to churn or not likely to pay premium after minimum lock in period. We used regression analysis (Logistic Model) to solve this problem.

Predictive analytics helps to improve every aspect of the business decision. It makes the business decision agile, consistent, precise and very cost effective. Some of the major values that application of Predictive analytics brings to your business are as follows:

  • Know your customers
  • Development of Decision system that is scientific and backed by Data
  • Reduction of cost and increase in profit
  • Consistency and Competitiveness in business decision
  • Ability to take complex decision at run time

Broadly analytical techniques that are used in predictive analytics can be divided into regression techniques and machine learning. Regression techniques includes linear regression, logistics regression , Discrete choice models, probit regression, time series models , survival or duration analysis etc. Machine learning includes neural network, MLP(multi layer perceptrons, Radial based functions, Naïve bayes etc.

ScoreData Corporation is a  pure play data analytics company focusing entirely on predictive analytics. We use advanced analytical techniques to solve complex business problem. We take pride in providing analytically rigorous solutions in very cost effective manner.

  • Share/Bookmark