Trying To Think

Monday, January 31, 2005

Data Mining

Data MiningData Mining: Practical Machine Learning Tools and Techniques with Java Implementations.

Witten and Frank, QA76.9.D343.W58/2000

Data Mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build conputer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Other will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts are garbled, some missing.(pg xix)

Useful patterns allow us to make non-trivial predictions on new data. (pg3)

One way of visualizing the problem of learning - and one that distinguishes it from statistical approaches - is to imagine a search through a space of possible concept descriptions for one that fits the data. (pg27)

Viewing generalization as a search in a space of possible concepts makes it clear that the most important decisions in a machine learning system are:

- the concept description language

- the order in which the space is searched

- the way that overfitting to the particular training data is avoided

These three properties are generally referred to as the

*bias*of the search and are called

**language bias**,

**search bias**, and

**overfitting-avoidance bias**. (pg29)

When building a decision tree, a commitment to split early on using a particular attribute might turn out later to be ill-considered in the light of how the tree develops below that node. To get around these problems, a "beam search" could be used where irrovocable commitments are not made, but instead a set of several active alternatives - whose number is the "beam width" - are pursued in parrallel. (pg31)

If disjunction is allowed, useless concept descriptions that merely summarize the data become possible, whereas if it is prohibited, some concepts are unlearnable. To get around this problem, it is common to search the concept space starting with the simplest concept descriptions and proceeding to more complext ones later:

**simpliest-first ordering**. (pg32)

Neural Networks: Bigus, 1996Genetic Algorithms: Goldberg, 1989

Four basically different styles of learning appear in data mining applications. In

**classification learning**, a learning scheme takes a set of classified examples from which it is expected to learn a way of classifying unseen examples. In

**association learning**, any association between features is sought, not just ones that predict a particular class value. In

**clustering**, groups of examples that belong together are sought. In

**numeric prediction**, the outcome to be predicted is not a discrete class but a numeric quantity. (pg38)

Inductive logic programming: Bergadano and Gunetti, 1996

**Instance based representation**: "just store the instances themselves and operate by relating new instances whose class is unknown to existing ones whose class is known. Instead of trying to create rules, work directly from the examples themselves .. instance-based learning is lazy, deferent the real work for as long as possible .. each new instance is compared with existing ones using a distance metric, and the closest existing instance is used to assign the class to the new one. This is called the nearest-neighbour classification method" (pg72)

Comparing against all instances is slow, so "deciding which instance to save and which to discard is another key problem in instance-based learning ... when training instances are discarded, the result is to save just a few proto-typical examples of each class ... these prototypical examples server as a kind of explicit knowledge representation" (pg73-4)

Other algorithms provide a hierarchical structure of clusters, so that at the top level the instance space divides into just a few clusters, each of which divides into its own subclusters at the next level down, and so on. ... diagrams like these are called dendograms. This terms means just the same thing as tree diagram. (pg76)

**Chapter 4 – Basic Algorithms**

In the infinite variety of possible datasets there are many different kinds of structure that can occur, and a data mining tool – no matter how capable – that is looking for one class of structure may completely miss regularities of a different kind, regardless of how rudimentary those might be. The result: a baroque and opaque classification structure of one kind instead of a simple, elegant, immediately comprehensible structure of another. (pg78)

**1R**(“one rule” - pg78ff): for each attribute, get the majority class for each value. Count the errors for each attribute (i.e. the number of values that did not get the majority class). The rules for the attribute with the least errors become the rules we use. Missing values are just another attribute value, and numeric values can be partitioned into ranges based on class changes, with a minimum number in each partition to avoid overfitting.

**Naïve Bayes**(pg82ff): For each attribute value, work out the % chance of each class. For new instances, use these % values to work out the most likely class. “Naïve” assumes that all the attributes are equal and independent. Adding the “Laplace estimator” handles attributes that produce zero probabilities. Assigning an “a priori” initial probability for each attribute removes the assumption that the attributes are equal, produces a full Bayes (but working out what the initial value should be is difficult, and don’t make much difference for large datasets).

Numeric values are handled by calculating the mean and standard deviation for each class and attribute, and then calculating the probability density function (pg86-7).

Problems: must be careful to choose independent attributes; must choose the correct distribution for numeric data (i.e. might not be normally distributed), or use “kernel density estimation”, or discretize the data first.

**Decision Trees**: select an attribute, and create a branch for each of its values. Then using another attribute, on each branch, create branches those instances which reach that branch. Continue until all instances would have the same classification from that branch.

We select attributes first that create the purest branches (i.e. the branch has the smallest number of classifications).

*Purity*is measured by information (which has units of bits) = amount of information needed to classify an instance at that node. Quinlin's C4.5 is dicussed in Chapter 6.

Note - this talk about information is similar to information theory (dretsky's book)

While decision trees use

*divide-and-conquer*(seek the attribute to best split the data into classes), the opposite approach is a

*covering approach*- take each class in turn and seek a way of covering all instances in it, while excluding instances that aren't in the class (-> a set of rules). We start with a rule that maximises the probability of getting the class right, and add rules until we have the desired tolerance. We then continue to add rules for each class.