CSC367: Introduction to Data Mining

 

Course general information

Quarter: Fall 2009

Lecture time: Thursday, 1:30-4:45pm

Lecture location: Loop campus

 

Course syllabus - for the course syllabus that contains info on the course objectives, textbook and topics, assignments and final project, grading, and class policies, click here.

 

What is data mining? Generally, data mining (sometimes called data or knowledge discovery) is the process of extracting meaningful, non-trivial, and hidden information from large collections of data. When the information is validated by domain experts, the information becomes knowledge. The extraction of nuggets of information from mountains of data involves techniques from several fields such as statistics, machine learning, artificial intelligence, and visualization.  

 

Where and how can data mining be used? Given the tremendous amount of data generated by the advances in the computer technologies, there is a need of data mining to extract information (patterns, trends, correlations, etc) for many domains such as:

 

1. Marketing: "Data mining in marketing falls into the broad area called database marketing.  It consists of analysis of customer databases to select the best potential customers for a particular product.  Business Weekly estimated that more than fifty percent of all U.S. retailers use or plan to use database marketing.  American Express has had good results from database marketing, experiencing a ten to fifteen percent increase in credit card use".  [1]

 

2. Banking and computer security: "In June of 1994, a computer expert in St. Petersburg, Russia, Vladimir Leonidovich Levin penetrated the Citibank electronic funds-transfer network. Over the  course of  five  months,  he funneled  10  million  dollars  into accounts  in  California,  Israel, Germany,  Finland,  the  Netherlands  and  Switzerland.    He  was eventually  apprehended,  and  most  of  the money was  recovered, but  the  incident  revealed  the  vulnerability  of  large  databases  to computer hackers [1].  Incidents such as the one described above make the security of a company’s computer system a serious issue in today’s business world. Although the raw information needed to detect an intrusion is often available in the audit data recorded by each computer, there is far too much of it generated each day for the system administrators and security officers to inspect it; scientific data mining algorithms are required to process this data [5]." [1]

 

3. Homeland security: Data mining allows the automatic analysis of databases and the recognition of important trends and behavioral patterns. Probably the most important and pivotal technology for profiling terrorist and criminals via data mining is through the use of machine learning algorithms, to automate the manual process of searching and discovering key features and intervals. For example, they can be used to answer such questions as “when is fraud most likely to take place?” or, “what are the characteristics of a smuggler? [2]

 

4. Health care: "Automated surveillance systems raise citizen concern over privacy. But it is possible to detect events and monitor trends even after patient demographic and personally identifiable data are stripped out. Military medical researchers in the United States are using such a system that gathers data from military medical facilities worldwide as well as from other healthcare sources. The system detects outbreaks (eg, the Norwalk virus in San Diego in 2002) when individual healthcare practitioners may not be able to see the big picture, and it monitors progression of diseases (eg, West Nile virus and listeriosis). [3]

 

5. Earth Science: “NASA  will  soon  launch  a  satellite  network  devoted exclusively  to  earth  science,  the  Earth  Orbiting  System  (EOS).  This  complicated  array  of  sensors  will  generate  forty-six megabytes  of  data  per  second,  which  is  almost  four  terabytes  a day.    For  comparison,  four  terabytes  is  enough  space to store fifteen  hundred  copies  of  the  thirty  two-volume  text  of  the Encyclopedia Britannica [2].  So once NASA has all of this data, they  will  need  some  way  to  analyze  it  to  produce  useful conclusions,  otherwise  EOS  will  not  be  of  any  use  to  earth scientists.  The only viable option for examining such a volume of information today is data mining algorithms”. [1]

 

6. Game Design: "Why Mine Data? Because players lie. Player feedback alone provides a poor diagnosis of game design. The picture a player's verbal feedback paints is not even an approximate guide. It is a distorted portrait of psychological and social forces. Players do not accurately report their own behavior in surveys or customer feedback. They may say one thing but do another instead. For example, anthropologist Dr. William Rathje surveyed the amount of beer people drank in a household and then went through their garbage. The garbage revealed twice as much consumption as the surveys had. This method was more insightful than surveys, which had been the traditional method of data collection. As psychological and social creatures, players, and developers, subconsciously revise their self-reports." [4]

 

7. Entertainment: "The BBC of the U.K. hired Integral Solutions Ltd. to develop a system for predicting the size of television audiences [1].  Integral Solutions Ltd.’s  program  used  neural  networks  and rule induction  to  determine  the  factors  playing  the  most  important roles in relating the size of a program's audience to its scheduling slot.    The final version performed as well as human experts but adapted more quickly to changes because it was constantly retrained with current data." [1]

Where can I get more information on data mining? - provides links to Web sites for current data mining products and ongoing research and development efforts

 

Contact information: Daniela Raicu, Associate Professor, CDM, DePaul University, Chicago, IL

 

Sources:

 

[1] The Emerging Field of Data Mining by Patrick Whalin (PDF file)

[2] Data Mining for Homeland Security by Jesus Mena (PDF file)

[3] Applications of Data Mining Techniques to Healthcare Data by Mary K. Obenshain (PDF)

[4] Better Game Design Through Data Mining by David Kennerly