Data Mining


Riju George

Program Manager
Hewlett Packard

Gold Rush: A Machine Learning Solution To Predict Sales Opportunity Conversion

The hi-tech industry operates in a significantly dynamic business environment, is a highly deal driven business marketing complex configurations of infrastructure and services to a variety of industry verticals. Moreover, this industry’s multi-tiered global supplier network needs significant fulfillment time for most components even as high as 6 months. Hence the supply chain function requires early and accurate visibility of deals to plan and position material. Early predictability of the sales pipeline conversion is also crucial to the firm’s sales function in order to focus sales force effort on the right deals and effectively manage revenue targets. Gold Rush predictive solution developed at Hewlett Packard Enterprise, using machine learning techniques, addresses the above challenges faced by the Supply Chain and Sales functions. We will outline: 1. Approach, machine learning algorithm employed and solution architecture 2. Application to business use cases 3. Results and expected benefits


Riju Thomas George obtained his bachelor of technology degree from the Indian Institute of Technology Madras, India and is currently working as a Program Manager in the Enterprise Group Analytics division at Hewlett Packard Enterprise. He is a supply chain professional with several years of experience in demand planning & forecasting, supply planning, inventory management and optimization. He has worked as a Consultant for Fortune 500 companies cutting across several industry verticals. He has also worked in core Supply Chain Operations during his tenure at HP.


Genetha Gray

Data Scientist
Intel Corporation

Improving Human Resources Procedures And Decision Making Using Analytics

There are a number of impressive examples of women and underrepresented minorities who have made significant contributions to the STEM (science, technology, engineering, & math) fields. Despite their successes, these groups remain significantly underrepresented in the tech workforce. In fact, in the summer of 2014, many of the large companies in Silicon Valley revealed just how low the representation of women is in tech positions (10-20%) despite market availability. To combat this trend and meet business needs, in January 2015 at the Consumer Electronica Show, Intel’s CEO Brian Krzanich announced a new commitment to promote positive representation of women and minorities as well as the goal of reaching parity by 2020. In this presentation, we describe how this announcement prompted a necessary culture shift in the human resources department itself and a new dependence on analytics to help Intel achieve its goals. We will describe how the problem of fundamentally changing the make-up of the workforce demands new ways of thinking about data fusion, data mining, data analysis, and data visualization techniques as well as new approaches to describing uncertainties in forecasting. The three main human resources issues that will be addressed in this talk include: attraction and hiring, promotion, and retention. Each has its own unique concerns and corresponding relevant data sources. Therefore, each must be addressed with a tailored analytics approach. Some data describe the environment outside the institution of interest such as census statistics, college enrollment and graduation rates, job postings, earnings reports, and miscellaneous news stories. Others may describe the particular institution and how it has changed over time. These include employee headcounts, business groups and goals, management structure, employee expertise, career progression, hiring and separation, and employee survey results. Note that many of these data have both qualitative and quantitative attributes and may not be informative to the problem of interest when considered individually. We will thus address the utility of specialized data mining and data fusion techniques in this application space. Moreover, we will consider the challenges associated with the low representation of the some of the employee subgroups of interest, and the resulting need to present data in ways that clearly illustrate the situation while protecting the privacy of the individual. Finally, we will examine traditional reporting values such as percentages and averages and show that they may not be useful in many human resources decision making environments. We will show why metrics for tracking progress must be well thought out and applicable at different scales. Using analytics to drive a change in the workforce gives a statistical basis and robust credibility to a problem that at first glance may seem to be only described by tribal knowledge. Moreover, in a tech workplace such as Intel, many of the decision makers also have a STEM background and were themselves trained to use facts and figures to assess situations. For example, in the laboratory, an engineer analyzes system behavior by collecting and analyzing data. Similarly, the workforce can itself be viewed as a system with numerous sources of credible data that can be used to describe it and drive decision making. This talk will address the issues of transformation to an analytics-based environment and showcase the opportunities and challenges of applying the tools and methods from the rapidly developing field of data analytics to the field of human resources. Specifically, it will focus on the how data can be used to increase credibility and visibility when describing the current workforce and its environment, setting meaningful and attainable goals, tracking and measuring progress, and forecasting the future workforce.


Dr. Genetha Anne Gray is a data scientist in the Talent Intelligence Analytics Organization at Intel Corporation where she analyzes talent supply chain management, career progression, and representation of women and underrepresented minorities. Before joining Intel in 2014, Genetha was a member of the technical staff at Sandia National Labs where she researched problems in the areas of systems engineering, the environment, security, and energy. She has a Ph.D. in Computational & Applied Mathematics from Rice University and specializes in optimization under uncertainty, data fusion, model validation, and uncertainty quantification. She has written over 20 peer reviewed papers, co-authored a textbook about environmental modeling due out in 2016, and given over 50 presentations.


Alexander Sasha Gutfraind

Chief Healthcare Data Scientist
Uptake Technologies

Nowhere To Hide – Analytics In Graphical Databases With Application To Covert Network Forensics

Whereas in the past most business data was stored in relational databases, recent years saw the emergence of a number of innovative database technologies, which are optimized for different applications. After reviewing the new technologies, we will describe Graphical Databases (GraphDBs) – a storage system designed to store entities and the relationships between them.
As a showcase of the GraphDB technology, we will focus on the problem of covert network forensics – the problem of reconstructing the relationships in a covert (crime or terrorist) network. We applied a GraphDB to map and reconstruct the covert network used to carry out the Nov. 13, 2015 attacks in Paris. We identified differences between Al-Qaida and the IS network, finding that the IS network uses smaller cells but may be relatively vulnerable to disruption by counter-terrorism authorities. Beyond covert networks, GraphDBs can greatly accelerate knowledge creation and mining for many applications, as compared to relational databases and SQL.


A. Sasha Gutfraind is a Data Scientist at Uptake Technologies and on the Faculty of the University of Illinois at Chicago. He earned a Bachelor’s and a Master’s from the University of Waterloo in Applied Mathematics and a Ph.D. from Cornell University. He has 10 years of experience in applying analytics to counter-terrorism and network analysis problems. At Uptake he develops predictive analytics for disruptive business applications and researches problems in network analysis, security and public health.


Randy Holl

Contact Solutions

Methods And Apparatus For Predictive Modelling And Analysis Of Textual Network Interactions

We consider a predictive modelling approach to interactive dialogs from digital customer service engagements to analyze topic occurrence trends over time. Contact Solution’s MyTime service provides a customer service engagement platform through which customers can interact with support agents in an instant message like environment. Topic modelling in this environment is challenged by high overlap in topic vocabularies, narrow topic separation, noisy input (misspellings, abbreviations, ‘chat speak’), and small dialogue sample lengths. We extract the natural language text from MyTime dialogues and perform term level analysis to regularize text structure and reduce redundant textual noise. A corpus is formed from the cleaned dialog texts, as a collection of M documents where each document is a series of N words from a regularized conversation instance. Document term Matrices (tdm) and term frequency – inverse document frequency (tf-idf) schemes are employed to analyze frequency of words within the corpus. We consider a variety of methods for identifying “most” relevant words including mapping word occurrence to word-probability space and extracting probabilistic outliers. Multiple topic modeling techniques to retrieve interpretable topics and label dialogues are presented and compared. The base topic model used in comparison is the Latent Dirichlet Allocation (LDA), which is a three-level hierarchical generative probabilistic Bayesian model in which each word is modeled as a finite mixture over an underlying, and defined a priori, set of topics. The model learns from the posterior distribution of topic probabilities generated from a training/build data set and is used to predict on new dialogue instances. Additional topic models are considered including the correlated topic model (CTM) which accounts for correlation between topics, and Dynamic Topic Models (DTM), which can analyze evolution of topics over time. We report novel evaluation techniques through cross evaluation and scoring systems to rank relative topic importance. We exemplify the results of the topic model through significant and tangible events that impact the topic distribution for specific clients in various industries. In addition suggest visuals and interactive features to demonstrate the output of topic modeling.


Randy Holl has led the development of high technology systems for over 25 years. His career includes an impressive record with noteworthy technology-driven companies across a wide span of technology sectors. He has received numerous commendations including the Chairman’s Award for excellence in innovation at Harnischfeger Industries.Prior to joining Contact Solutions, Holl served in the executive level business and Information Technology positions of Chief Operating Officer, Chief Technology Officer, VP of Development, and Director of Software Architecture for major software product and service companies including FIS, Hyperion, Ernst & Young, and IBM, and Harnischfeger Industries. Mr. Holl also served as a qualified submarine officer in the U.S. Navy.


Anna Olecka

Manager, National Analytics Advisory Practice

Multi-stage Clustering Deepens Understanding Of Consumer Banking Behavior In A Rising Rate Environment

In December 2015 Federal Reserve raised interest rates for the first time in a decade and signaled future raises. As the interest rates rise, banks will want to re-prize their offerings on deposit products offered to consumers. Banks will look to predictive models to help in customer level pricing decisions. But a lot has changed in the decade: consumers became savvy and selective, competitive information is widely available via the internet, nimble direct banks with low overhead entered the scene narrowing the profitability margins, to name just a few challenges. Due to these changes, historical data from a decade ago will not be applicable in today’s environment. A new approach to understanding consumers’ needs and wants is required to optimize the offerings in the rising rates environment.
Behavioral economics and data obtained through market research can help banks understand which consumers can be attracted and retained with targeted banking products and strategic pricing and which customers are likely to flee in search of better rates.
However building in depth understanding of consumer behavior based on market research data can be challenging. If some of the behavioral attributes dominate (i.e. show stronger discrimination than others), the less dominant but essential attributes can get lost in the shuffle.
This is the case with choice models for banking preferences. Rate sensitivity dominates other features such as bank type, products or channel preferences, so – in a standard unsupervised clustering process – the rate related dimensions discriminate nicely among consumers, but the other, non-rate preferences come out flat.
In this talk we show how a multistage clustering approach can overcome the feature domination challenge and tease out the dominated of features such as product and service channel preferences.
This work is based on data PwC’s Dynamic Pricing Survey of 4K consumers conducted in 2015.
-We review consumer choice data collected from the survey
    – We show how a multistage approach creates separation in all desired dimensions and profile the resulting segments
    – Benchmark our results against an unsupervised, standard clustering approach separating on rates only
    – We also discuss how banks can use this technique to optimize value proposition for individual customers and prospects, thus remaining competitive in the raising rate environment.

While the examples are drawn from the financial services, this segmentation approach can apply to any supervised or unsupervised learning where some classes of characteristics are more dominant than others.


Anna Olecka is an analytic expert, with over 19 years of experience and a proven track record of improving business results through data analytics and technical innovation in data mining (patent holder), Optimization and Customer Analytics. She has successfully lead complex analytic engagements from initial scoping to implementation for a number of financial institutions, as well as a leading commercial data provider and a large telecommunications company.