Welcome to #TWIMLcon Shorts - a series where I sit down with some of our awesome Founding Sponsors and talk about their ML/AI journey, current work in the field and what we can expect from them at TWIMLcon: AI Platforms!
First up is Luke Marsden, Founder & CEO of Dotscience. Based in Bristol, UK, Luke joins me to share the Dotscience story and why he is most excited for #TWIMLcon next month! From a stellar breakout session featuring the Dotscience manifesto to live demos at their booth, we can’t wait!
Sam Charrington: [00:00:00] All right everyone, I am on the line with Luke Marsden. Luke is the founder and CEO of Dotscience a founding sponsor for TWIMLcon: AI Platforms. So Luke, we go back a little bit from your involvement in the docker space. I remember introducing you at a session at Dockercon quite a few years back, but for those who aren't familiar with your background, who are you?
Luke Marsden: [00:00:51] So hey Sam, and thanks for having me on. My name is Luke Marsden, I'm the founder and CEO of Dotscience and I come from a devops background. My last startup was called Cluster HQ, and we were solving the problem of running stateful containers in docker. And so I'm a sort of serial entrepreneur based out of the UK. I live in the beautiful city of Bristol in the southwest and very excited to be involved with TWIML.
Sam Charrington: [00:01:28] Awesome. So tell us a little bit about Dotscience and what the company is up to in the AI platform space.
Luke Marsden: [00:01:36] Yeah, sure. So we started Dotscience a couple of years ago. Initially, we were targeting the area of data versioning and devops but we quickly realized that the tool that we built which is an open source project called dotmesh was actually much more relevant and important to the world of AI and machine learning which has a big data versioning and reproducibility problems. So we pivoted to that about a year in, and we've been building an AI platform around that core concept of data versioning.
Sam Charrington: [00:02:13] So tell me a little bit more about that. How are you taking on data versioning? And why is that an important element of the puzzle for folks that are doing AI?
Luke Marsden: [00:02:25] Absolutely. So there's really sort of four main pieces of the puzzle that I believe need to be solved to achieve devops for AI, devops for machine learning, and number one is reproducibility - and that's where the data versioning piece comes in. So what we've seen is that there's a lot of chaos and pain that happens when AI or ML teams start trying to operationalize the models that they're developing. And one of the big pain points is if you can't actually get back to the exact version of the data that you use to train your model, then you can't go back and solve problems with it. You can't fix bugs in the model or or really reliably understand sort of exactly where that model came from.
So that's kind of that fundamental problem of like which version of the data that I trained is this model on and that's what we solve with with Dotscience. Every time you train a model in Dotscience, you are automatically versioning all of the dependent data sets that that model training happens on. And by using copy-on-write technology, which is a file system technology and in dotmesh, which is part of the Dotscience platform, it does that very efficiently using no more disk space than is required to achieve reproducibility.
Sam Charrington: [00:03:52] Awesome. So tell me why are you excited about TWIMLcon: AI Platforms?
Luke Marsden: [00:03:59] TWIMLcon looks to be an awesome event. We were actually planning on hosting our own event around the same time in San Francisco to promote Dotscience, but TWIML was such a good fit for what we're trying to do, and the themes and the topics that are being discussed in the space, that we decided to join forces with you guys and become a Founding sponsor rather than running our own things.
So yeah, really, really excited and looking forward to it.
Sam Charrington: [00:04:34] That's fantastic and we are super appreciative to have you on board as a Founding sponsor, it is great to have your support in that way. When folks come to your breakout session at TWIMLcon, tell us a little bit about what you'll be covering there, who will be presenting, what can attendees expect to learn from the breakout session.
Luke Marsden: [00:04:57] Yes, so the session will be run by my colleague Nick who's our principal data scientist, and the basic premise of the talk really touches on some of the things I mentioned earlier. There's a lot of chaos and pain trying to operationalize AI and that we have this manifesto of things that we believe are needed to go from, sort of the "no-process" process that is the default. So when you start an AI or machine learning project and you have maybe a small number of data scientists or machine learning engineers doing that work, they'll invent a process, right? Any technical group that's doing technical work will make up a process as they go based on the tools that they're familiar with and they'll do their best.
But the point of the talk is that the "no-process process," it gets your first model into production when your team is small, but that's really where the problems begin and (Laughter) you end up with this sort of this kind of mess of models and data sets and deployments and hyperparameters and metrics and all these different things flying around, because machine learning is fundamentally more complicated than software development software engineering. And so, by just sort of doing things in an ad-hoc way, you get yourself into this sort of mess quite quickly, and this is something we've seen across hundreds of companies that we've spoken to in the industry.
And so basically what we're proposing is a Manifesto, that you should make your machine learning process, the whole process of building, training, deploying, monitoring machine learning models that you should make that that whole process reproducible, accountable, collaborative, and continuous.
And so what I mean by reproducible is that somebody else should be able to come and reproduce the model that I trained now, like 9 or 12 months later without me still needing to be there, without me needing to have kept meticulous manual documentation. Somebody else should be able to go and rerun that model training against the same version of the data with the same version of Tensorflow with the same code, with the same hyperparameters, and get the same accuracy score to within a few percent.
If your development environment isn't reproducible, then you won't be able to do that, but we believe that that is key to achieving devops for ML.
So anyway, that's kind of a snapshot of some of the things we'll be talking about in the session. So yeah, please please come along.
Sam Charrington: [00:08:00] Awesome. You'll also be present in TWIMLcon's Community Hall, what can attendees expect to see at the company's booth? Will they be able to get hands on?
Luke Marsden: [00:08:15] Absolutely, so we'll have live demos at the booth. You can see the full end-to-end platform and our Engineers as I speak in the early part of September today, are busily working on the latest features that we're going to have ready in time for the conference in true startup conference driven development mode. (Laughter)
So, we will have the deploy to production and statistical monitoring pieces ready in time for the conference. So, it's probably going to be the first time that you can come and see those pieces of the product and and get hands-on with the product will be at TWIML, so please come and check it out.
Sam Charrington: [00:09:00] Fantastic. Luke, thanks so much for chatting with me about what you're up to and what you'll be showing at the event, we are super excited to have you on board with us for TWIMLcon: AI Platforms.
Luke Marsden: [00:09:10] Awesome. Thank you Sam.
TWIMLcon: AI Platforms will be held on October 1st and 2nd at the Mission Bay Conference Center in San Francisco. Click here to learn more