Course general information
Quarter: Fall 2009
Lecture time: Thursday, 1:30-4:45pm
Lecture location: Loop campus
Course syllabus - for the course syllabus that contains info on the course objectives, textbook and topics, assignments and final project, grading, and class policies, click here.
What is data mining? Generally, data mining (sometimes called data or knowledge discovery) is the process of extracting meaningful, non-trivial, and hidden information from large collections of data. When the information is validated by domain experts, the information becomes knowledge. The extraction of nuggets of information from mountains of data involves techniques from several fields such as statistics, machine learning, artificial intelligence, and visualization.
Where and how can data mining be used? Given the tremendous amount of data generated by the advances in the computer technologies, there is a need of data mining to extract information (patterns, trends, correlations, etc) for many domains such as:
1. Marketing: "Data mining in marketing falls into the broad area called database marketing. It consists of analysis of customer databases to select the best potential customers for a particular product. Business Weekly estimated that more than fifty percent of all U.S. retailers use or plan to use database marketing. American Express has had good results from database marketing, experiencing a ten to fifteen percent increase in credit card use". [1]
2. Banking and computer security: "In June of 1994, a computer expert in St. Petersburg, Russia, Vladimir Leonidovich Levin penetrated the Citibank electronic funds-transfer network. Over the course of five months, he funneled 10 million dollars into accounts in California, Israel, Germany, Finland, the Netherlands and Switzerland. He was eventually apprehended, and most of the money was recovered, but the incident revealed the vulnerability of large databases to computer hackers [1]. Incidents such as the one described above make the security of a company’s computer system a serious issue in today’s business world. Although the raw information needed to detect an intrusion is often available in the audit data recorded by each computer, there is far too much of it generated each day for the system administrators and security officers to inspect it; scientific data mining algorithms are required to process this data [5]." [1]
3. Homeland security: “Data mining allows the automatic analysis of databases and the recognition of important trends and behavioral patterns. Probably the most important and pivotal technology for profiling terrorist and criminals via data mining is through the use of machine learning algorithms, to automate the manual process of searching and discovering key features and intervals. For example, they can be used to answer such questions as “when is fraud most likely to take place?” or, “what are the characteristics of a smuggler? [2]
4. Health care: "Automated surveillance systems raise citizen concern over privacy. But it is possible to detect events and monitor trends even after patient demographic and personally identifiable data are stripped out. Military medical researchers in the United States are using such a system that gathers data from military medical facilities worldwide as well as from other healthcare sources. The system detects outbreaks (eg, the Norwalk virus in San Diego in 2002) when individual healthcare practitioners may not be able to see the big picture, and it monitors progression of diseases (eg, West Nile virus and listeriosis). [3]
5. Earth Science: “NASA will soon launch a satellite network devoted exclusively to earth science, the Earth Orbiting System (EOS). This complicated array of sensors will generate forty-six megabytes of data per second, which is almost four terabytes a day. For comparison, four terabytes is enough space to store fifteen hundred copies of the thirty two-volume text of the Encyclopedia Britannica [2]. So once NASA has all of this data, they will need some way to analyze it to produce useful conclusions, otherwise EOS will not be of any use to earth scientists. The only viable option for examining such a volume of information today is data mining algorithms”. [1]
6. Game Design: "Why Mine Data? Because players lie. Player feedback alone provides a poor diagnosis of game design. The picture a player's verbal feedback paints is not even an approximate guide. It is a distorted portrait of psychological and social forces. Players do not accurately report their own behavior in surveys or customer feedback. They may say one thing but do another instead. For example, anthropologist Dr. William Rathje surveyed the amount of beer people drank in a household and then went through their garbage. The garbage revealed twice as much consumption as the surveys had. This method was more insightful than surveys, which had been the traditional method of data collection. As psychological and social creatures, players, and developers, subconsciously revise their self-reports." [4]
7. Entertainment: "The BBC of the U.K. hired Integral Solutions Ltd. to develop a system for predicting the size of television audiences [1]. Integral Solutions Ltd.’s program used neural networks and rule induction to determine the factors playing the most important roles in relating the size of a program's audience to its scheduling slot. The final version performed as well as human experts but adapted more quickly to changes because it was constantly retrained with current data." [1]
Where can I get more information on data mining? - provides links to Web sites for current data mining products and ongoing research and development efforts
Contact information: Daniela Raicu, Associate Professor, CDM, DePaul University, Chicago, IL
Sources:
[1] The Emerging Field of Data Mining by Patrick Whalin (PDF file)
[2] Data Mining for Homeland Security by Jesus Mena (PDF file)
[3] Applications of Data Mining Techniques to Healthcare Data by Mary K. Obenshain (PDF)
[4] Better Game Design Through Data Mining by David Kennerly