Data Mining is the procedure of extracting valid, recently unidentified, comprehensible, and actionable information from large databases and deploying it to make important business decisions (Connolly, 2004). This survey will explore the concept of data mining and give insight to the primary operations associated with its techniques: predictive modelling, repository segmentation, link evaluation, and deviation detection.
The concept of Data Mining keeps growing in popularity in business activity generally. We are residing in an information era, and we've more and more data been produced in every aspect you can think of. Each time you swap your food card, looking to get a discount when buy wherever products. That's data being downloaded into a database, and most business deal you do, there exists some sort of data download. Organizations are saving, producing and analysing data more than any moment in history which trend will continue to grow.
Data Mining is the incorporation of mathematical methods that can include mathematical equations, algorithms, traditional logistic regression, neural sites, segmentation, classification, clustering, etc. Those are all methods that utilize mathematics. Data Mining does apply across industry sectors. Generally wherever we've processes, and wherever we've data, it is the application of the powerful numerical techniques that will extract trends patterns
In this section I will be describing a few of the center Data Mining duties. For each process I'll give an example to illustrate each of the functions, which does not need to be used singularly, rather they can be combined together to have a more relevant output.
Data Mining Tasks are divided in Predictive responsibilities and Descriptive jobs:
"Classification maps data into predefined teams or classes" (Dunham, 2002). Those organizations/classes are molded before the actual data analysis. A classic example of classification software is to find out whether to authorize charge card purchase.
The example bellow illustrates a classification problem:
Airport security screening points uses design popularity systems to find potential scammers or terrorists. Those systems can scan anybody that is crossing the airport hall to identify his distinctive habits (eyes, form of the head, oral cavity size, etc). Those habits can be weighed against many other patterns from the database to see if fits with the scanned person.
Regression is a method that uses equation into a given dataset and assumes the info fits in some kind of function, such as linear or logistic. The linear regression is the simplest form, and it runs on the straight line formula (y = mx + b) where it determines the value of b and m to anticipate the value of y, predicated on a given value of x.
With Time Series Examination the feature is analysed as it changes over time. The data is usually recorded within an evenly space of time (every second, minutes, daily, hourly, etc). An example of time series would be daily final values of Platinum index price. Time series may be used to forecast events predicated on previous known recent data. An example of time series forecasting would be predicting the stock price of a given company, based on its previous performance.
Time Series: Gold daily value over earlier 8 months (Boursorama. com)
Clustering is a method that seeks to divide instances into clusters, sharing similar qualities. The goal is simply to explore the structure of the info, sorting it into similar categories (cluster) that stocks similar characteristics. "The higher the similarity within a group and the higher the difference between teams, the better will be the clustering. " (Tan, 2006)
Different means of clustering the same set of points
(a) Original points
(b) two clusters
(c) four clusters
Association Analysis is very important and one of the most used task within the info Mining domain. It is very useful to discover relationship among data, and identify specific types of associations. An extremely common application of association guidelines would be analysing supermarket baskets to discover organizations like (customers who bought milk also bought cheese)
Data mining methodologies can be use in a number of different conditions, such as making process control, scams diagnosis, risk factors in medical identification, image recognition, and many others. Follow bellow a few of common domain where Data Mining can be employed:
Advertising: Whenever we speak about advertising and data, we think about Yahoo. The internet search engine company works together with data in the Petabyte level, and it uses a non traditional way of arranging its data. Yahoo uses mathematical models with an incredible number of data, and it is with no question one of the most profitable companies on the planet. "Google conquered the advertising world with only applied mathematics. It didn't pretend to learn anything about the culture and conventions of advertising - it just assumed that better data, with better analytical tools, would get the day. And Google was right. " (Anderson, 2008)
Shopping: Supermarkets have been keeping keep track of on customer buying for long time. But only with a popularization of Data Mining, this data could really be utilized. For example, Tesco clubcards data can be analysed to anticipate what customer will buy, how they will pay, and even just how many calories they'll consume.
Education: Data Mining techniques can be employed in educational environment to analyse college student learning behavior, performance during academic year, and even prediction about how the pupil will perform during an exam.
Fraud Detection: Another relevant field of application, fraud detection influences different companies such as bank (credit card fraudulence detection, illegal deals) and insurance (checking for incorrect cases).
Risk Analysis: Risk Evaluation estimates the risks connected with future decisions. For example, a bank can form a predictive model, based in past observations, to determine if is suitable to give a mortgage to a person.
Text Mining: Word Mining attempts to assemble meaningful information from different kind of texts, to be able to classify documents, books, e-mail and webpages. A good example of text mining software includes creation of filters for e-mail communications and newsgroup.
Image Identification: Beneficial to recognizing characters, discovering human faces, uncovering associations and anomalies. A credit card applicatoin example includes discovering dubious behaviours though monitoring video camera.
Web Mining: Web Mining applications are created to analyse clickstreams - the collection of visit from users in websites. It really is useful in analysing e-commerce websites, as it can provide customizes web pages for customers
Tools for Data Mining are very powerful, however they require very skilled specialist who are able to prepare the data and understand the outcome.
Data Mining brings about the patterns and relationships, but the relevance and validity of these patterns must be produced by an individual.
As any technology, Data Mining has its pitfalls with level of privacy and moral concerns. There are several arguments about how exactly privateness should be tackled. Some thinks that Data Mining is ethically natural; however, just how Data Mining is being used nowadays is bringing up many concerns, as advertising companies are buying customer spending data and behaviour at the price of reduced level of privacy.
There is many ways in which data mining can bargain privacy. To begin with, data mining requires an extensive data planning which can reveal previously unidentified information or patterns. For instance, many datasets from different sources can be putted together for the purpose of analysis (called data aggregation). The threat comes when someone, who has usage of this data, can identify or track down specific individuals.
There are risen concerns about how exactly much organisations know about our personal lives. For instance, if you aggregate datasets from various resources, such as organisations, internet sites, etc; you would know everything about your life: Your full address, telephone, age, just how many vehicles you have, which cars you have, what type of house your home is, what you do, what you take in, what you drink, where will you go, how much cash you spend, your religion and beliefs, what are your likes and dislikes, etc. The list is infinite. What can happen if those aggregate data comes in wrong hands? The info we've been inserting online could be used against us. For example, USA data mining industry have software's where monitors social multimedia on the internet, the so called "Pre-Crimes", where "information about individuals that may ultimately enhance the American work environment into a hopeless get away from" (Burghart, 2010)
Following Burghart on his article in theSkyValleyChronicle. com, "Another company deploys an automation software that slogs through Facebook, Twitter, Flickr, YouTube, LinkedIn, personal blogs, and a large number of other sources, to build up a written report on the 'real you' -- not the carefully crafted you in your resume. "
Another recent problem happened when personal stats of 100 million Facebook customer information have been scanned and distributed online. In my judgment this is just the start of a much higher problem that will occur as time passes.
"THE INFO Mining idea will grow in level of popularity, because data continue to grow. Think about communal networking, such as Tweeter and Facebook. It is data that describe people, and what they do, what they are. Data is produced when you buy, sell, or even though you go to work data is being downloaded whenever you swap your Oyster credit card into the underground system. More and more we having data gathering and data capturing, and it is just how it is at this information current economic climate. The best way to extract strategic information from that data. Those data resources. This is Data Mining. " (Dalio, 2010) - http://www. telegraph. co. uk/technology/7963311/10-ways-data-is-changing-how-we-live. html