![]() |
Saed Sayad, M.D., Ph.D., University of Toronto, Canada Title: Scalable Data Mining |
Abstract: Data mining is about explaining the past and predicting the future by exploring and analyzing data. Data mining is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence and database technology.
Currently, although data mining methods are widely used in extremely diverse situations, in practice, one or more major limitations almost invariably appear and significantly constrain successful data mining applications. Frequently, these problems are associated with large increases in the quantity of data to be processed and the term “scalable” is used to describe how well a data mining method can accommodate an increased data load. However, such scalability problems are usually closely coupled with the fact that conventional data mining methods operate in a batch mode where having all of the relevant data at once is a requirement. Thus, here we define a scalable data mining technique as having ALL (not just a select few) of the following characteristics, independent of the amount of data involved:
Upgrading conventional data mining to scalable data mining is through the use of a method termed the Scalable Linear Machine or SLM. The use of the SLM with conventional data mining methods enables “scalable data mining”.
- Incremental learning: immediately utilizing new data without the necessity of pooling new data with old data and returning to the model formulation step.
- Decremental learning: immediately removing data identified as adversely affecting model performance without forming a new dataset omitting this data and returning to the model formulation step;
- Variable addition: immediately utilizing values for a new, not previously .considered, variable, without forming a new dataset containing values of that new variable and then returning to the model formulation step;
- Variable deletion: immediately discontinuing use of a variable identified as adversely affecting model performance without forming a new dataset omitting values of that “bad” variable and then returning to the model formulation step;
- Scenario testing: rapid formulation and testing of multiple and diverse models to optimize data fitting;
- Real-time operation: processing so rapid that streaming data can be accommodated;
- In-line operation: processing that can be carried out in-situ (e.g.: in a mobile device, in a satellite, etc.)
- Distributed processing: separately processing distributed datasets or segments of large dataset (that may be located in diverse geographic locations) and re-combining the results to obtain a single model;
- Parallel processing: carrying out parallel processing extremely rapidly from multiple conventional processing units (multi-threads, multi-processors or a specialized chip).
Our motivation in authoring this tutorial is to help anyone interested to understand the method and to implement it for their application. The tutorial provides previously published [1-5, 7] and unpublished details on implementation of scalable data mining and a free software program that can be used for non-commercial purposes to implement it.
We begin by showing equations enabling scalable data exploration previous to development of useful models. These “scalable equations” appear similar to the usual ones seen in many textbooks. However, closer examination will reveal a slightly different notation than the conventional one. This notation is necessary to explain how what we term “scalable equations” differ from conventional ones. We then detail how a “Basic Elements Table” is constructed from a dataset and used to achieve scalability in a data mining method. Then, each of the following methods is examined in turn and the scalable equations necessary for utilization of the Basic Elements Table are provided: na?ve Bayesian classifier, linear discriminant analysis, linear support vector machines, multiple linear regression, principal component regression, linear support vector regression, Markov chains and hidden Markov models. Finally, the supplied tutorial software is described.
SELECTED PAPERS RELATED WITH THE TUTORIAL/TALK:
1. Shuo Yan, Saed Sayad, Stephen T. Balke, “Image Quality in Image Classification: Adaptive Image Quality Modification with Adaptive Classification”, Computers & Chemical Engineering, Computers and Chemical Engineering 33 (2009) 429–435.
2. Keivan Torabi, Saed Sayad and Stephen T. Balke, “On-line adaptive Bayesian classification for in-line particle image monitoring in polymer film manufacturing”, Computers & Chemical Engineering, Volume 30, Issue 1, 15 November 2005, Pages 18-27.
3. K. Torabi, S. Sayad and S.T., Balke, "Adaptive Bayesian classification for real-time image analysis in real- time particle monitoring for polymer film manufacturing”, Fifth International Conference on Data Mining, Text Mining and their Business Applications, Malaga, Spain, 2004. WIT Press.
4. Saed Sayad, Stephen T. Balke and Sina Sayad “An Intelligent Learning Machine”, 4th International Conference on Data Mining, Rio De Janeiro, Brazil, 1-3 December 2003.
5. Keivan Torabi, Saed Sayad and Stephen T. Balke, “Data Mining for Image Analysis: In-Line Particle Monitoring in Polymer Extrusion”, 3rd International Conference on Data Mining, Bologna , Italy , 25-27 September 2002.
6. Glenn Fung, O. L. Mangasarian; Incremental Support Vector Machine Classification, Proceedings of the Second SIAM International Conference on Data Mining, Arlington, Virginia, April 11-13, 2002,R. Grossman, H. Mannila and R. Motwani, editors, SIAM, Philadelphia 2002, 247-260.
7. Sayad S., Sayad M.H., and Sayad J.: “Neural Network with Variable Excitability of Input Units (NN-VEIN)”, Proc. 22nd Annual Pittsburgh Conference on Modeling and Simulation; Pittsburgh, 1991.
Bio Sketch: Dr. Sayad is an Adjunct Professor at the University of Toronto. He and Professor Stephen Balke established the first data mining research group at University of Toronto and present a popular data mining graduate course. Dr. Sayad has more than eighteen years experience in data mining and statistics. Since 1990 he has been involved in many business and scientific applications of data mining and published many research papers.