Blog - Science, models and machine learning


  • Options

    Hi everyone. My approach to writing articles is, once I've got a list of ideas, to write some sketchy text to see what flows into what. (It's the only way I can find of writing that doesn't result in lots of forward references and other problems.) So it's EMPHATICALLY JUST A SKETCH (THAT WILL BE FLESHED OUT) at the moment. However, does anyone think there are any other basic machine learning concepts that will be relevant to the El Nino stuff that ought to go in this "intro to fundamental concepts" article? I can't think of any myself, but I'd welcome any input.

  • Options
    edited August 2014

    I'm trying to learn about Hessian-free networks:

    and a very literate blog post:

  • Options

    David it seems you are editing the new wiki, which is supposed to be overwritten soon (?? at least thats what I have understood) by the old wiki.

  • Options
    edited August 2014

    I believe the overwriting already happened Monday night/Tuesday morning.

  • Options

    The page looks very good for a SKETCH.

    There is a basic concept that I think is missing: a discussion of loss/utility functions. When we make predictions, it is so we can take actions, even if it just cancelling a picnic due to a forecast of rain. These have costs and benefits depending on what kind of errors are involved, and can change the optimum threshold for a classifier, for example. (You know this of course, I am suggesting some ideas on how to explain it.)

    It should also be pointed out that there's not a fundamental distinction between machine learning and statistics.

    I think statistics should be statistical inference.

    Of course, loss/utility functions are subjective, as exemplified by Half man Half Biscuit's A Shropshire Lad:

    Second greatest time I had
    Was when they asked me and my Dad
    To organise a festival
    Along the lines of Donington
    We took Chirk Airfield as our site
    Booked the bands we thought were right
    Received the long-range from the Met
    They said it could be very wet
    With this in mind, we thought it wise
    To call the whole caboodle off
    The greatest time I ever had
    Was when we didn’t tell the bands
    Boom boom boom
    Let me hear you say
    Hosepipe ban
  • Options

    David Tweed wrote:

    I believe the overwriting already happened Monday night/Tuesday morning.

    Yes. The new wiki is open for business. We're just waiting for the name change to propagate.

  • Options

    Thanks for the comments. I've incorporated Graham's suggestsions. While Jim's is very interesting I also think it's a much more advanced topic that deserves its own post (maybe from Jim). I've done some tweaking & typo correction. The other thing I've done is add wikipedia links for a lot of the terms used.

    The big remaining problem is the initial car example. Ideally I'd like something as easy to understand with a weird effect that's easy to model but difficult to explain which only uses physics. However, I'm drawing a blank on what that could be, so it may be best to just stick with the existing example.

    I think all the key content is there, but it does need some more polishing of the wording.

  • Options

    So I've reached the point where I can't think of anything else I want to do to the article. If anyone has any comments/changes let me know and I'll try to address them, but otherwise I think it's finished. (From an "interesting layout" perspective it could do with more images, but I can't think of any more relevant ones to include.)

  • Options
    edited August 2014

    Nice blog article, surveying a broad sweep of ideas. The writing tone is also crisp and good.

    I have some questions/comments on the specifics, which I'll put in some followup posts.

  • Options
    edited August 2014

    In the section called "Training sets and test sets" you talk about the training set, the test set, and an optional validation set. Can you say what the validation set is for.

  • Options
    edited August 2014

    In the discussion of sparsity of models, an example to concretize the ideas would be helpful.

    Aren't attempts to make the models sparse biases introduced by the researchers? I understand the issues of computational complexity, etc., but one has to be careful about compromising models for reasons that have nothing to do with the subject matter.

    One the other hand, one could argue that any model is a simplification, and hence compromising part of the truth in order to be understandable or computable by us.

    But still I feel that one needs to careful about how and why models are compromised.

    For instance, there are all kinds of models for the semantics of logic programs that involve negation. The purely declarative interpretation of such logic programs will not in general have a unique minimal model. And answering queries using the semantics of logical consequence is not computationally feasible. In response, there's a whole literature devoted to different ways of choosing "the" model for the logic program, and using this to answer queries. In this pursuit, the complexity of computing the model is an important factor. But what's the point of these variant notions of truth and consequence? If it's because they are of interest in themselves that's one thing, but if its because they thing we want is uncomputable, that's another. If the latter, then I'd rather go through the stages of mourning and acceptance, and then move on to other pursuits which are feasible.

  • Options
    edited August 2014

    WRT sparsity: If you read papers around the LASSO, etc, you see a general theme that putting in a sparseness prior discourages the model from using randomly occurring correlations between "not actually relevant" inputs and outputs to improve its overall score when fitting in a way that improves model generalization. Of course it's a matter of degree: putting a hugely weighted sparseness prior is likely to bias things, but it does seem like a moderate sparseness prior is helpful.

    Regarding an example, on the one hand it'd be good but I'd have to introduce a model and then add a sparsity prior to it, so I'm not sure if it'll grow too big. I'll have a think...

  • Options

    It's true that you are dealing with a wide swath of material, and so size is an issue. It already feels on the brink of being large enough to be split into two blogs (though I'm not sure along which lines). The examples could push it further across that threshold -- which could be a good thing.

  • Options
    edited August 2014

    To exacerbate the problem, I think the paragraph on latent variables could use an example.

  • Options

    There are a lot of ideas involved in the section "Inferring a physical model from a ML model," and I don't think they can be clearly addressed in such a small number of paragraphs -- you're introducing the whole idea of data mining. Perhaps this section could be lifted out, and put into a second blog article?

  • Options

    A point about terminology.

    In the introduction you make a good, clarifying distinction between "data driven" and "physical" models.

    I think that the counterpart to data-driven models includes physical models, but is broader than that. For instance, consider a predictive model for a game of blackjack, where the model is used as part of a program that plays in a game with specific, known opponents.

    The input to the program may contain rules such as: Jim will always hold if his hand exceeds 17. On the basis of such rules, and the observable portion of the game, the model makes ongoing predictions about the probabilities of various events.

    I wouldn't call these rules "physical" (one could stretch the term, but it becomes questionable), but they are part of the opposite of data-driven models.

    What about theory-driven models versus data-driven models?

  • Options

    Regarding theory vs physical, I'm not sure. The article isn't an attempt to be a general machine learning article but to sate some of the things it would be helpful to refer to from other El Nino articles. (I'm too lazy to write Nino correctly on the forum...) That argues physical models are most appropriate, but then there's the sore thumb of the car example. So maybe theory driven is better.

    (Incidentally I've stared adding a sparsity example to the text, but I'll wait until you've finished editing to do more.)

  • Options
    edited August 2014

    The text says:

    We also stress there's not a fundamental distinction between machine learning and statistical modeling and inference.

    But the link labeled "statistical modeling and inference" leads to a page on statistical learning theory.

    There are good connections here, but I wouldn't say that there's no distinction between machine learning and statistical modeling, or even statistical modeling and inference.

    The connection that I see -- after reading your paper -- is that machine learning can be brought under the general umbrella of statistical modeling. Every statistical model is based on some kind of rules (or knowledge). It's machine learning when the rules are derived from the data. Otherwise the rules are based on a human-produced theory. I.e. data-driven modeling versus theory-driven modeling.

    p.s. You may have noticed that I had the edit lock on the page for a little while. That was just so I could see the source text for the link.

  • Options

    A stckler could say that even machine learning is theory-based, it's just that the theory is learned. But the distinction is what drives the model -- are the rules given as input, or inferred from data.

  • Options

    Re 19:

    The wikipedia link is to the closest concept I could find, but you're right it's not a perfect match. If there's anything better...

    On the more general point, I'm basically trying to dispel the myth that "cool" machine learning people do like neural nets and decision trees (which must thus be cool) while "boring" statisticians do stuff like regression and discriminant analysis (which must thus be boring). You can find good, interesting papers by people from both departments working on all kinds of techniques. (I do have a vested interest given I'm looking at regression at the moment.) For example, which class does a graphical model belong to? It really seems to me to be equally studied in "machine learning labs" and "statistics departments".

    I still think there's not a fundamental distinction in the kind of things being done between the two.

  • Options

    Re #20: I think there are a lot of things that people think of as "statistical" that have no "theory about the problem domain" in them, eg, PCA, linear discriminant analysis, regression. Even saying they're thought of as statistical is coming down a bit too much on one side: I'm pretty sure virtually every introductory machine learning course would cover them as well.

    BTW: if you just want to see the page source, you can click on the word "Source" in "Views:" towards the end of the line at the very bottom of a page on the wiki.

  • Options

    I've changed the "statistical modelling and inference" link to a more appropriate one, added an incredibly brief explanation of the validation set and complete the sparse prior example.

  • Options
    edited August 2014

    Sorry to have been so silent - finishing a book will do that to you!

    Here's one comment... my wife just got back home and it's dinnertime so I'll have to continue tomorrow:

    The most common way to work with collected data is to split it into a training set and a test set. (Sometimes there is a division into three sets, the third being a validation set which is used when the model has meta-parameters as a test set to find the best values for them.) The training and validation sets are used in the process of determining the best model parameters, while the test set – which is not used in any way in determining the best model parameters – is then used to see how effective the model is likely to be on new, unseen data.

    It's a bit confusing to introduce the third, optional "validation set" before you've said what the "training set" and "test set" are. You introduce this "validation set" and say it's "used as a test set" before you even say what a test said is. Then you say "the training and validation sets are..." giving some feature common to both those two, even though you just said the validation set is used as a test set, making it sound like those two are similar. Then you say something about the test set.

    Don't feel bad, this sort of convoluted circular exposition is typical of people trying to explain things they understand too well! It always helps to replace the terms one is trying to explain by meaningless symbols like X, Y and Z, since they'll be meaningless to the people reading the explanation:

    The most common way to work with collected data is to split it into a X and a Y. (Sometimes there is a division into three sets, the third being a Z which is used when the model has meta-parameters as a Y to find the best values for them.) The X and Z are used in the process of determining the best model parameters, while the Y – which is not used in any way in determining the best model parameters – is then used to see how effective the model is likely to be on new, unseen data.

    I recommend something more like this:

    The most common way to work with collected data is to split it into a training set and a test set. The training set lets us choose the best model parameters. The test set – which is not used in any way in determining the best model parameters – is then used to see how effective the model is likely to be on new, unseen data. (Sometimes there is a more elaborate division into three sets, the third being a validation set which is used when the model has "meta-parameters", in order to... something or other, preferably not including the phrase "test set".)

    It would be great to say what a "meta-parameter" is... or simply leave out this "validation set" stuff, which may be too much for a beginner who is trying to absorb all this information for the first time. I don't really know how a "meta-parameter" differs from a parameter, though I know that most kinds of abstract thing come along with a "meta-things".

    Anyway, it looks great as far as I've gotten. More later!

  • Options
    edited August 2014

    I think I'll reword it to make it clear that there are approaches that use more than just a training and test set without actually mentioning a validation set as an example (so I don't have to explain it further).

    Just for this conversation (not the blog), later in the article there's an example of regression using an $l_1$ prior. The coefficients $c_i$ are optimizable directly from the test data, so they're parameters. However $\lambda$ you can't meaningfully optimize at the same time but have to do that by testing a whole classifier on some different data (the validation set), so it's a meta-parameter.

    Changes should be done when you start re-editing tomorrow.

  • Options
    edited August 2014

    Okay, great! I'm glad you're not explaining validation. I've hardly ever heard anyone complain an exposition didn't cover enough material. People love a clear explanation of a small set of concepts.

  • Options
    edited August 2014

    Something scary is happening on the wiki: I see the dollar signs surrounding math symbols as dollar signs. Does this mean TeX isn't working on the new wiki, or what?

    Also: I can't edit the new wiki since my IP address is blocked:

    Access denied. Your IP address,, was found on one or more DNSBL blocking list(s).

    This is a National University of Singapore IP address. I can get around this problem with a VPN. This sort of problem seems to come and go...

  • Options
    edited August 2014

    My experiences have been that if you do "dollars without spaces" the new wiki does math correctly. If I do the "prep for wordpress" thing of leaving spaces it seems like they don't work. I have no idea why!

    BTW: edit regarding validation set issue now done.

  • Options
    edited August 2014

    Okay. I don't recall asking people to "prep for wordpress" by leaving spaces. If one wants to prep for the wordpress blog, it would be much better to surround math text by something like ( ) where there's a distinction between the left symbol and the right symbol... since then I can use a global search-and-replace to convert ( to $latex and ) to $.

    It's a real defect in old TeX that there's no visible distinction between a "left dollar sign" and a "right dollar sign". Knuth, as a computer scientist, should have known that a language where left and right parentheses were the same would be very bad!

  • Options

    But anyway, don't worry about this dollar stuff for the purposes of this blog article... though it makes me worry about the new wiki.

  • Options
    edited August 2014

    Regarding the blog article and the dollar signs: I think it all stems back to me using html paragraph markers on each paragraph. However I've just looked at the source for some of the other blog articles and see I didn't need to do that. Maybe getting rid of the paragraph markers is simplest (but I'll leave that decision and action for you).

    EDIT: I've dropped the paragraph markers around the two paragraphs involving mathematics but left the rest in place.

  • Options
    edited August 2014

    I've deleted all the paragraph markers, since it wasn't hard to do and they aren't needed.

    I'm making a bunch of small changes which you might want to review, but here are my main questions (or requests):


    An extreme example of preprocessing is explicitly forming new features in the data.

    For some reason, "forming new features" seems a bit jargonesque to me, perhaps because it has a plain English meaning which isn't what you mean. "Forming a new feature" sounds like growing a nose, or the formation of a volcano, or creating something that wasn't there. But I guess what it means is "computing a complicated function of the data, which lets us discern important features that were hiding in it."

    So maybe a word of explanation would be nice. You illustrate it with an example that will make sense to people who have been carefully following the El Niño series, but a simple definition would be nice first - like:

    An extreme example of preprocessing is explicitly 'forming new features' in the data: that is, computing a function of the data that lets us discern important features.

    (It's nice to teach people the jargon of the trade, so I'm not suggesting that you get rid of this term.)


    One factor that's often related to generalization is regularization and in particular sparsity.

    I'm glad you're bold-facing defined concepts, but you never really define regularization here. It would be good to either define it or skip it; some readers may click on the link but not many. You only really discuss sparsity.

    I use single quotes for terms to indicate "you may not know what this means yet, but don't worry, I don't expect you to know what it means". So if you don't want to explain regularization you can say

    One factor that's often related to generalization is 'regularization' and in particular sparsity. (then, explanation of sparsity)

    3) You also don't explain "identifiability", but I just put single quotes around this, and it's already in a parenthetical remark so people will know they can skip this remark if they want.

    That's it! All in all, a remarkably clear and inviting introduction to some big ideas! Thanks!

  • Options
    edited August 2014

    Thanks. I've made changes for 1 and 2, and think your fix for 3 works.

  • Options

    Great! I'll post this after David Tanzer's new post has had several days to shine.

    Right now hits on the blog have soared thanks to his post. Most of them are coming from Twitter, for some reason.

  • Options

    This is a bit of a rant about the use of the word model:

    ... ‘machine learning models’, but could equally have used ‘statistical models’.


    a model is any systematic procedure ... providing a prediction

    I don't like this usage of the word model. I don't think it makes sense to call a procedure a model. You are stretching the word, so that includes the method of inference/estimation along with what I think should be called a model. You are far from the only person who does this, and maybe its too late to do anything about it, but I really wish people wouldn't!

    Wikipedia has several meanings for the word, including:

    • Model (economics), a theoretical construct representing economic processes

    • Model (physical), a smaller or larger physical copy of an object

    • Scale model, a replica or prototype of an object

    • Computer model, a simulation to reproduce behavior of a system

    • Conceptual model (computer science), representation of entities and relationships between them

    • Mathematical model, a description of a system using mathematical concepts and language

    • Statistical model, in applied statistics, a parameterized set of probability distributions

    They're all fine. So is animal model. They are all types of stand-in for the real thing. But they are not procedures for prediction. They can be used for making predictions, but that is a separate step. It becomes particularly confusing when you have a stochastic model (a better name than statistical model) and an estimation method (maximum likelihood, Bayesian posterior mean, etc) and people call the whole thing a model. For example, "maximum likelihood model" is used in this Nature paper Reconstructing the early evolution of Fungi using a six-gene phylogeny to refer to a model (in the correct sense) of evolution, and a particular statistical procedure.

    As an alternative I suggest machine.

  • Options

    Re #35: thanks Graham, I see what you mean. I've changed the model definition to

    For our purposes here, a model is any object which provides a systematic procedure for taking some input data and producing a prediction of some output.

    Looking at the rest of the text it looks like it's reasonably OK in its use of "model", so I haven't changed anything else.

  • Options

    Thanks to everyone for all their really helpful review comments that have greatly improved the blog article.

  • Options

    I posted the article!

  • Options
    edited September 2014

    Thanks. I've set myself a hard target of sep 13 for having a rough version of the actual data exploration blog done; if it's not done by then it's very likely not to be done. Anyway john, hope you have a good break.

  • Options

    How is that "hard target" going?

  • Options

    I haven't been able to put myself into concentrating as much as I'd hoped. However, the other thing in my life that had set that hard target has also ended up pushed back by a week, so there's still a possibility it'll get completed if I can get it done during the next week. I've made a start on the "expository" part of the blog, which has thrown up a couple of "layout" questions that I'd like your input on, that I've asked on the new thread for that blog entry.

  • Options
    edited September 2014

    I think Peter referred to this text:

    Joshua S. Bloom and Joseph W. Richards, Data mining and machine learning in time-domain discovery & classification (2011)

    It has very clear descriptions of some ML methods. I've tried to summarise them but I'm not sure which wiki page would be most appropriate to post it on if it's of any use?

    • Kernel density estimation (KDE) classifier: class-wise feature distributions are estimated using a non-parametric kernel smoother.

      Con: difficulties estimating accurate densities in high-dimensional feature spaces (the curse)

    • Naive Bayes classifier: class-wose KDE on one feature at a time, assuming zero covariance between features

      Con: zero covariance is unlikely to be true.

    • Bayesian network classifier: assumes a sparse, graphical conditional dependence structure among features.

    • Gaussian mixture classifier: assumes that the feature distribution follows a multi-variate Gaussian distribution where the mean and covariance of each distribution are estimated from the training data.

    • Random forest classifier: significantly statistically outperforms Gaussian mixture classifier.

    • Quadratic discriminant analysis (QDA) classifier (or linear discriminant analysis *(LDA) classifier if pooled covariance estimates are used): refer to the type of boundaries used between features.

    • Support vector machines (SVMs): find the maximum-margin hyperplane to separate instances of each pair of classes. Kernelisation of an AVM can easily be used to find non-linear class boundaries.

    • K-nearest neighbours (KNN) classifier: predicts the class of each object by volting its K nearest neighbours in feature space, implicitly estimating the class decision boundaries non-parametrically.

    • Classification trees: perform recursive binary partitioning of the feature space to arrive at a set of pure, disjoint regions.

    • Trees:


      • can capture complicated class boundaries
      • are robust to outliers
      • are immune to irrelevant features
      • easily cope with missing feature values.


      They have a high variance wrt. their training set due to their hierarchical nature. Bagging, boosting and random forest overcome this by building many trees to bootstrapped versions of the training data and averaging their results.

    • Artificial neural networks: non-linear regression models which predict class as a non-linear function of linear combinations of input features.


      • computational difficulty with many local optima
      • lack of interpretability (said by the authors to have severely diminished their popularity among the statistics community (2011).
  • Options

    Hello Jim

    Each algorithm is quite sophisticated to adopt for learning for a particular data, it is an incredible effort and an art for such training and usually people just throw some data at an algorithm and most of the times the results are inferior.

    I will try to setup some as I did for SVR for anomalies data, and you will see it requires serious effort to train


  • Options

    I will try to setup some as I did for SVR for anomalies data, and you will see it requires serious effort to train.

    Sounds good to me. So far I've only got as far as seeing a few of the difficulties. I thought it might be useful to list a few of the different techniques currently in play.

  • Options

    Jim you did the right thing, sadly my world is the "rubber hits the road" world :)

  • Options

    Jim wrote:

    I’ve tried to summarise them but I’m not sure which wiki page would be most appropriate to post it on if it’s of any use?

    How about putting your summary on a new page called Machine learning? I don't see any existing pages that are quite right; I could be missing something. The main thing is to get it up there!

  • Options
    edited September 2014

    Hi John, I didn't start a new page because I didn't want folks who see David's blog article to look for machine learning on the wiki and get a load of incomprehensible hessian-free techniques - I should have flagged this to David.

    David, Would you mind starting off a Machine Learning page with some definition of SVR. I used to be able to recognise a back-propagation algo but apart from that I have no neural net fu even though I had the first Rummelhart big red coffee table book. I can and will write up my notes on Yann Le Cun's work.

    Paul has also listed the 4 techniques he's been working on.

Sign In or Register to comment.