DataFramed
DataFramed

Episode · 5 years ago

#4 How Data Science is Revolutionizing the Trucking Industry

ABOUT THIS EPISODE

The trucking industry is being revolutionized by Data Science. And how? Hugo speaks with Ben Skrainka, a data scientist at Convoy, a company that provides trucking services for shippers and carriers powered by technology to drive reliability, transparency, efficiency, and insights. We'll dive into how data science can help to achieve such a trucking revolution, and how this will impact all of us, from truckers to businesses and consumers alike. Along the way, we'll delve into Ben's thoughts on best practices in data science, how the field is evolving and how we can all help to shape the future of this emerging discipline.

In this episode of data framed, a data camp podcast, will be looking at how data science is being used to revolutionize the trucking industry. I'll be speaking with Ben Scranker, a data scientist that convoy, a company that provides trucking services for shippers and carriers, powered by technology to drive reliability, transparency, efficiency and insights. Will dive into how data science can help to achieve such a trucking revolution and how this will impact all of us, from truckers to businesses and consumers alike. Along the way, will delve into Ben's thoughts on best practices in data science, how the field is evolving and how we can all help to shape the future of this emerging discipline. I'm Hugo bound Anderson, a data scientist at data camp, and this is data frame. Welcome to data frame, a weekly data camp podcast exploring what data science looks like on the ground for working data scientists and what problems data science can solve. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and you can also follow data camp at data camp. You can find all of our episodes and show notes at data Campcom community slash podcast. been welcome to data framed. Thanks you. It's great to be here and I'm looking forward as being with you today about convoy and data science. I couldn't agree more. It's great to have here and in our ongoing exploration on data framed of what data science is and what it can be, I'm pretty done excited to speak, to be speaking to you today about your work at convoy and the road that of signs can play in revolutionizing the trucking industry, aarguably one of the most impactful industries in North America. But first I want to talk about you. Okay, thanks. People always ask me what to data scientists actually do, Ben. What do your colleagues think that you do? Yeah, that's a great question. I think that, you know, the term data scientist is really vague and it means different things to different people, and I come to data science from both a natural science and a...

...social science background. I also took a detour through Silicon Valley before I got my PhD in economics. So for me I kind of spend half my day in meetings talking about the science of how to run our platform or how we're developing our approach to data science here or specific questions, say experimental design. I spend about half my time mentoring other data scientists, so helping a junior data scientists work on an experimental design for a campaign or, say, build a survival model to understand customer retention. And then I spend another half of my time working as an individual contributor, thinking about the aspects that drive our platform and just trying to have impact by fixing the thing that's currently on fire. So think about things. was like auctions are matching or pricing and those kinds of issues. Anyone who can do math will probably note that that adds up to three halfs, but you know, I work at a startup, so things don't add up to one. I couldn't agree moll at data cap. I feel like sometimes I'm upwriting in the form of an infinite sum actually, and as your backgrounds in physics, and you've told me you know there are only three numbers, two physicists, right, right, there's zero one in infinity and everything else as a matter of scaling. That's the old joke exactly. So how did you get into data science? You know, that's a great question and I think I've always been working with software and data and math and Statistics and models. And you know, after Dj Patel coined the term and it went viral, then someone in the Marketing Department rebranded me as a data scientist. But even when I was an undergraduate in college I was writing software to understand scientific question. So the first data sciency thing I did was I wrote a program to understand greater relaxation for bill McKinnon, who's an...

...amazing planetary scientist at Washington University, and the Band Frankie was big there and since we were dealing with creaty creter relaxation, I called my program Frankie. So that made date me. And then more recently, after I did my PhD, I was at University of Chicago and was recruited to work at Amazon, and so in that time that was like two thousand and twelve. So at that point I'd say I was really made the transition to taking stuff from Academian appliant and industry. So now you'll look, he's a diet of scientist at convoy. Right, tell me a bit about what what convoy convoy does, what it's mission is. Yeah, this is a great example of how it's much easier for software to come into a established industry than for a established industry to bring software in itself. And so the freight industry has been around for a long time. In the US it's a super huge market. It's like eight hundred billion dollars, and there's this problem, which is most shippers are heavily fragmented, like the ninety five percent sorry carriers, and carriers are trucking companies. They have one or more trucks, and the carriers are heavily fragmented and the ninety five percentile carrier has like two or three trucks. And so the way the shippers connect with them were through these brokers, and the brokers operate using s technology, so phones and fax machines. And what we're doing a convoy is we're helping pioneer this new approach called digital freight, where we're using technology and in particular data science to make this whole process better for everyone, because it's really important to match the right carrier with the right shippers load, because then everything works better. So we need to price it, match it and automate the whole process, and that's just going to revolutionize the cost structure of the industry. And for all of you out there. You're all listening and you're also consumers. Most things you buy cake, you know, eight or ten truck trips to get to you, and so if we can lower the price of freight or start the cost of freight prices, hopefully should become much more competitive on things you buy. So what does this look...

...like on the ground for either? So you said, carry is the drives. So carriers are. It's a kind of complicated definition because a lot of them are like owner operators. So kind of the American dream. They could be. For instance, here in the Pacific northwest, could be like a Russian immigrant who comes over here with nothing and then works his way up into becoming a owner operator with one truck and then, as he becomes more successful, he could build out his fleet, and that's something convoys trying to help owner operators do. or it could be a more established midsize truck and company. So typically they have one or more trucks. It's hard for an individual to scale the more than about twenty trucks because then logistics becomes complicated and you have, you know, HR and things like that to deal with, for sure. So what does convoy look like on the ground. For carriers, do they use Appsol? Yeah, that's a great question. So the main way they interact with us is through a through the convoy APP, and we have an onboarding process that makes it really simple for them to give us their packet of information, so things like insurance and licensing information, and then we can, you know, activate their account and at that point, you know, they will see offers that are served to them based on the kind of loads they tell us ay like. So they might say, Hey, I like to operate on the I five corridor, and then they can accept those loads in an APP and never have to talk to a human. So it's super efficient. And they can also bid on APP on loads as well. So if they don't like our price, you know, or they're competing with other carriers for load, they can bid. The price can go up or down depending on market conditions. And then the other thing we do that's great for them is, say they take a load from San Francisco to Los Angeles. Well, if we see they do that, we then say hey, we've got a load from Los Angeles, you know, coming back to where you were, are going somewhere else, and so we're aware of that, to just try to keep them moving and earning money so that they're empty lefts...

...off and because it turns out the trucks are empty about forty percent of the time, which is bad for the environment and bad for the truckers. Yeah, absolutely, imagine. There are old types of travel, traveling silement, salesman problems and trying to figure out you don't want to necessarily have a drive, a drive an empty truck from Seattle to the bay area in order to then transport stuff from from there elsewhere. You'd like to figure out how they can actually be that trick can be optimized with respect to how much how much they carrying. Yeah, absolutely. And so as we get more liquidity on the platform, it becomes easier and easier to be able to solve these kinds of problems and basically keep carriers in a state of almost constant motion. That's really cool. So this is one of the reasons I find this so interesting is because when we think about the world data signs plays in modern, modern industry, we think of tech a lot of the time, which is, you know, an industry that was was born with this riseing in access to data, whereas you're talking about revolutionizing and industry that that pre predates all of this this techological infrastructure absolutely and if we're successful in what we're trying to do, we are going to change the cost structure for a major industry that affects many Americans. You know, one of the amazing facts I've learned at convoy is that the most common job in something like forty five ish states is truck driver. So this affects a lot of people in terms of their employment and it also affects everyone who is a consumer. This can have a big impact on many people's lives in a positive way, in particular for the truck drivers himself. were making it easier for them to run their own business, grow that business and, you know, just have a much, much less friction and how they run that business. For sure. So how does dita science then play a pivotal role in convoys mission? So data science is central to what we do and in fact, from the first day we open, we've...

...always been automated, and so we have to solve a wealth of fascinating problems with lots of economics. So it's very important to be able to predict the price because we need to price correctly two shippers. Often we enter into long term contracts with them, and we also have to price correctly for the carriers. Then we need to make sure we solve the matching problem to match the right carrier to the right it load, because that is better outcomes for everyone. So, for instance, right could mean less deadhead space, so that's the time you drive empty to get to the start of the load, and also the destination and points could matter. We worry about auctions and the whole price acceptance mechanism for the carriers themself and as well as many other things related to the carrier life cycle and making sure that everything's going well once the carriers picked up a load. So they're a wealth of problems like that and it stracts me is as as a business you might come up against that Chi can an egg problem in the sense that to convince shippers to come on board with you, you need to have carry us on board, and convince carrys you need to have shippers. Yeah, that's something that's really hard for any platform. There's a great post by Simon Rothman, who's one of our investors from the series a around, and he wrote this great post and he says. Basically, what you have to do is you have to start two companies at the same time. So you've to bootstrap a company on the supply and the demand side at the same time, and it makes it really tricky because we've got to keep this in balance. And so fortunately we've an amazing sales team and they're really good at generating quality demand and then we quickly, you know, build supply to try to keep everything in balance, and that's something we track and that's something any platform worries about. They think a lot about liquidity and maintaining balance. It's like, you know, if you want to a dating site, and let's assume your heterosexual, you know, and if there're no women on there, it's a bad dating experience, and so you need...

...to have equal numbers of men and women. The reason this is kind of the front of my mind is previously we've had a similar challenge at Diata camp in the early day, is getting getting students while getting instructors as well, because instructors one of an audience, and and students what the best instructors? Right? Yeah, absolutely, yeah. And then the other hand, says are also oere network effects. So once it gets going. These businesses tend to grow exponentially, so it's really exciting. It means that if you do this right, you'll find that you're not sleeping absolutely. So you told me one of the most important things is is a really strong sales team. How is the data science team integrated into the company with, respective, for example, the sales team? Yeah, I think one of the big things we can do is help them understand pricing, things like how to bid on loads. What loads should you bid on? And those are some ways that we can work very closely with them. They're pricing is an incredibly complex thing and they're all kinds of incentives around it. So that would take our stand back. We've been discussing kind of circling around the the role of of ditasciens. What what a specific types of data science questions that you need to answer in your job? So I think one of the things that's really good about convoy in our culture is we're really data driven and so we do a huge amount of experimentation, just like everyone else. I would zoom in industry or who you know, certainly the top players, and so the way we answer questions is is often through experimentation if that's the only way to solve it. So we do a lot of experimentation. We've invested heavily in building a very good experimentation framework that enables us to iterate quickly on experiments so that they're different approaches to running AB tests, like Basian or frequentists and, you know, like the Beajian or sequential analysis, will get you an answer much more quickly. They also make it easier to discuss results with product managers. Let's just step back a bit and maybe you can give us an example of an ibetist that...

...that you'd perform great. So you know, the kind of thing that you know you would typically Ab test is, does some new UX flow work better, and so like. Let's say one of the things we care about is that when a driver completes a trip that they automatically upload their paperwork. It's something called a bol bill of lading, and if we rolled out, say, a new process to improve that and make it easier, we could run an experiment where we put half the carriers, keep them on the old technology and put half in the new technology, and then after some time we can say with some probability whether or not the new process works better. That's a that's a great example in a great description of I be testing as well. I'm going to probe a bit more. You mentioned the Baijian method. AIDS converge moal quickly or give you results more quickly than frequentist methods. I know I'm put you on, putting you on the spot here, but would you mind, for the light people out there, just giving a brief description of the difference between Frequentis statistics, which people may be more familiar with, and the Baisian methods that you'll discussing in this case? Certainly so, you know, the first real approach to statistics was the Beajian method that was developed by Thomas Bazi, was actually a vicar, and his idea works a lot the way your intuition works, which is you start out with some prior set of beliefs about something like my new process is better, say it's going to cause, you know, a ten percent lift, and then over time, as you observed data, you update those beliefs, and so this beajian updating then will converge to what we call the posterior and that is the you know, the distribution that we expect the lift to have the frequentist view. And so before I go into frequent is, the Bajian methods were very hard to compute until, you know, maybe two decades ago. We didn't really have the...

...computational resources. Since then they've been huge improvements and it's much easier to compute these models. Before then you can only really solve special cases the so people like are a Fisher in particular, were very critical, I think, to the Beajian approach and so they developed the frequentist approach, you know, in the early nineteen hundreds, if I'm correct, and there the idea is there is some true value of the parameter and if I sample enough data, as I get more and more data, that's going to converge to the truth. And so frequentis tend to talk about P values and confidence intervals and that's the condition traditional hypothesis testing you're used to knowing. So that in the frequentist world you'd have to go to a marketing manager and say conditional on then, all hypothesis being true, the probability that I observed in effect as big as the one we saw are bigger, is, you know, five point seven percent, and so then you'd have this argument because the traditional view on significance is, you know that you would use the significance level of five percent to say that the result was significant. So in this case, if you are a good person, you would not be able to reject in all hypothesis, but your marketing managers probably going to say five point seven percent, really close to five percent. Let's say we just use a ten percent significance level and you're in this world a hurt, whereas with the Baysian method you can go to the product manager and say there's a, you know, ninety four point three percent chance or ninety eight percent chance that vary in a is better than varying B, and so it's a it's a much easier conversation to have. Fantastic. And in the example you're talking about variant I in variant ball, the parametera we're looking at would be the number of people who successfully upload all their paperwork correct as a function of whatever the UX looks like. Right. So it's whatever you're testing, like, you know, does my new workflow have...

...higher click through radar, higher check out? You know, does it lower, you know, churn? Whatever you're interested in. So yeah, I think the Besian method makes it much easier to have conversations with product managers. In addition to you know, in our simulation studies we find it converge as much more quickly it for our business. You know, your mileage may vary on your business and if you're in Europe your kilometers may vary. Yeah, I think I recalled Dave Robinson, was actually wrote a number of posts for stack overflow about Baisian AB testing there and and showed that for for their experiments it didn't necessarily converge more quickly. I need to I need to check that and we put that in the in the show notes as well. Yeah, they're also were some great blog post by Evan Miller on the subject. Let's now dive into a segment called data science. Buzz words with data camp curriculum leads spends, a Boucher will stop Spencer. Hey, Hugo, what data science buzz would are you going to de mystify for us today? Today let's talk about big data. We use that term a lot, but a lot of people find themselves hazy on what exactly it is. Big Data means different things to different people, and asking how big is big data it's a bit like asking how long is a piece of string? So where did the term day to come from? So a paper published in I Tripoli by Michael Coxon David Elsworth Thata in one thousand nine hundred and ninety seven is widely considered to be one of the first introductions of the term. In that paper, the authors discussed the problem of visualizing data that doesn't fit into memory. This would actually continue to be a major theme of analytics through the twenty one century all the way up to today. But maybe the biggest milestone in the history of big data came back in two thousand and one when a white paper by gardener analyst Doug Laney proposed three defining qualities of big data that have become very well known in the industry volume, velocity and variety. Can you begin by telling me about volume, which seems like an intuitive quality to think about right?...

So, in general, when we talk about big data, what we're talking about is data that doesn't fit onto your laptop or even usually on to even one machine. Technologies like could doop have been developed primarily to deal with exactly this issue, leveraging large networks of computers to operate on large chunks of data simultaneously, for certain algorithms at least, that allows data scientists to scale to any amount of data simply by adding more hardware to the problem. So, for example, facebook has a cluster with over one hundred machines and twelve head of bytes of raw storage. Just to give you a sense of scale, fifty Peta bytes would roughly correspond to everything ever written down in all of mankind's recorded history. Wow. So what about velocity then? Yeah, so velocity is another big part about what makes big data tricky. In a world where every single click can generate innumerable data points, streaming tools have emerged to crunch numbers in real time while things are still changing minute to minute or even second to second. Streaming technology is like spark and Kafka, address this tricky problem of analyzing data that arrives and changes extremely quickly, opening the door to really interesting anomaly detection algorithms that operate in real time on a very large scale. For example, these days streaming engines are capable of handling over sixty million records per second. So we have volume, velocity and the final quality was variety. Yeah, exactly variety. Can refer to the explosion of data storage formats in recent decades. But on a more fundamental level, it means the vast strides that were making expanding the kinds of data that we're able to cheaply and easily collect. It's not unusual for one company or even one data analysis these days to involve text, Click data, audio, video and all other sorts of unstructured data. So, Spencer, how would I know if my data is Big Dato? Well, in the end, you go that's not what...

...really matters. What matters is that you're applying the right tools and techniques to your data, appropriate to whatever it's volume, velocity and variety are. There you have it, folks. Data Science Buzz would de mystification in action. Thanks. Been So. Yep, anytime you go. Now back to our interview with Ben scranker. So experimental design, that's a that's a really interesting approach, because I don't think a lot of a lot of people would expect that experimental design and these types of methods would play such a huge role in reinventing the trucking industry using, using data science. What other types of techniques and methods you guys interested in? Yeah, it so just before we go on. I think one of the things I would you know, just to close out a be testing, is there's a lot of wisdom that people have about the trucking industry. So we you know, our company is half tech start up and half trucking industry veterans and they have a lot of hard one knowledge and intuition, but it's not always correct or precise, and so by performing experiments we can make things much more concrete interesting. Is there something political about that as well, and social in this instant? You a lot of these people have been around for significant amount of time, have certain amounts of power and kind of harder knowledge in some ways, and is there a view that, you know, text ad ups can come and you know in inverted commas disrupt that and it needs to be kind of social behavior reflecting that? Yeah, I mean I can only speak within the confined to the convoy culture and we have a great, really team oriented culture. It feels like when I played on good hockey teams. And you know, I think that both sides are really appreciative for what they bring to the table and you might even have two industry veterans who don't agree about something and so in convoy we try not to have disagreements they can't be resolved through you know it start. Let me rephrase that. So if a disagreement can't be resolved through some kind of...

...intellectual argument with theory or facts, then we run an experiment instead of sitting around and having a argument that can't be resolved. So what else? What are the types of techniques and methodologies you guys interested in using? Yeah, so, I mean we're very much a practical, applied data science shop. We're trying to solve concrete business problems in, you know, short amounts of time, and so we use kind of the standard tool kit you would expect us to use. So we're very agnostic about tools. So people tend to use the best tool for the job. And so, in terms of technologies, that could be our or python. They both have strengths and weaknesses. I think it's good to know both. And then, in terms of specific approaches, sometimes machine learning is best, like, particularly if we need to predict something, then you know, like predict whether or not someone's going to upload a bol or, say, be a good carrier, then you might build a logistic regression or, you know, some kind of boosted classifier. But there are other times where you need to understand if a cause be maybe we weren't able to run an experiment and then we would be back in the world of applied statistics and do some kind of regression analysis, kind of my first win a convoy was before I started. They had released a feature without an AB task and they wanted to know whether or not this new feature helped, and I was able to use the causal impact package that Google developed using like bag Bejian structural time series to show that the new featured had a beneficial impact. Fantastic, and that's on data already collected, right, you know so we have existing data and you know so. Then you have to go and try to make sure that everything is as good as randomly assigned, hopefully, so you know, one of the key, you know, features to be a experiment rights, you need random assignment to treatment and you also need to satisfy some other things like that your assignments individualistic and probabilistic and unconfounded.

So these are technical terms and so if you if any of these fail, then you're back in the world of observational data and then you're going to use applied statistical methods or econometrics to try to create something that's as good as randomly assigned so that you can then make some kind of causal statement about whether or not a cause B. could you remind me what econometrics is? Oh, econometrics is the kind of a set of statistical tools that economists of developed for dealing with economic problems and for the business world those tools are super helpful because most of our problems are economic in nature, and so, you know, a classic example would be dealing with something like sample selection and other forms of what are called Endogeneity, where you have outcomes that are coded termit within the models. So you like the classic example as if you're trying to understand whether or not increasing the size of your police force will reduce crime. Well, you could think of crying is probably a function of the amount of police you have, but police itself is also a function of the level of crime, and so there's this. You know, that's an example of simultaneity sample selection. Once you start to look for it, you see it everywhere. That would be like, you know, I try to run and experiment where to see if small class size improves reading comprehension. But all the parents of kids who are posh insist that their kids are in the small class, and so now you have kids who are in the smaller classes are more clever, and now you've got selection bias. And so econometrics has created a bunch of tools to deal with these types of challenges, right and you like. In particular, there's one set of tools that are wellknown to economists but not to data scientists outside economists, and that's panel data, and these tools are really good for dealing with what economists would call individual heterogeneity. So if I'm looking at the behavior of carriers or shipments, these have, you know, individual quirks...

...that I can observe, and if I can observe, say, a carrier over time, pannel data gives me great methods to remove this these unobserved individual effects that could confound my estimates. That's very interesting and it seems to make a correct me if I'm wrong. That that what you're saying is that there's a whole bunch of tools that have been developed by some very smart people in econometrics that could be utilized in Datasciens but haven't seen the light of day yet in this world of data science. Right. And you know, hey, I love machine learning and it's great, but there are also were problems where it doesn't work, and I think people have become overfocused on machine learning to the point of overlooking econometric methods that are often very useful and can solve problems that can't be solved with machine learning. I remember talking with someone at Uber a while before I join convoy and he actually said, you know, they had encountered problems that they could not solve with machine learning, but they could only solve them by building a structural econometric model, which is a very complicated process. It often takes about a year or more to build one of these models and get it to work well, but you try to model the whole behavior process and utility function. But when you're done you have a very rich and powerful model where you can make good predictions about counterfactual outcomes. Cool, and that was it, UBA, you said. Yeah, it actually happened during that my interview with them when I was I interviewed with them before I went to convoy. It also sounds as though we've discussed machine learning, econometrics, experimental design and Basian methods for AB testing. It seems like you have a lot of geographical data, geospicial data and time series of geospatial data. Does this play a role in any of your work? Yeah, I think that that date is really important and we're just beginning to unlock what it can do. But we use it in the APP and a lot of ways to make carriers experience better. So, for instance, when a carrier shows up...

...at a pick up a load, we can automatically check them in based on geospatial data. Some loading facilities are very poor and if they keep the truck waiting too long they have to pay something called detension, and so we can just start auto paying out the tension when the carrier is eligible for it, instead of making them go through some laborious documentation process, which is not dissimilar from prying to file and insurance claim in the US. Something that sprung to Monday when you said, you know, we have all this data which perhaps you know you have an explored in all its potential. Yet it seems like they could be a potential for all types of social research with respect to the diety you're uncovering as well. Yeah, I think that there are a lot of really interesting questions we can answer about matching and platforms and auctions that I would love to get into more deeply. I'm sure there are many academic papers it could be written on the subject. What ditasis projects in particular convoy, have you been involved in that you consider the most impactful on, or telling about about society? That's a great question. So I've primarily and I've been a convoy a little over a year and we're bit over two years old, and I primarily focused on pricing and experimentation. I think one of the most interesting experiments that we ran was shortly after I started, we ran an experiment where we gave preferential access to loads, to higher quality carriers, and quality went down and everyone was shocked. Like weight, we're giving like the high quality people early access to work and qualities going down. What's going on? This is like, this makes no sense and you know, a bad manager would say hey, you data scientist, you're stupid. Fortunately, we have like a really good data science manager here is the is male who's built a...

...great data culture and he let us dig into it and so I started to think about the matching literature and Economics and so my hypothesis was that because we had restricted the pool of carriers to a smaller pool, even though they were higher quality, the match quality on the job went down, and so we were able to verify that and then we did some regression analysis to show that match quality had a causal impact on quality, and so that was a really exciting discovery because I think it showed how important matching is on our platform. That's incredible. So to passe that just for myself. Giving high quality carriers early access to loads meant that the matching them to shippers, the quality of that matching algorithm went down in some sense, and that was that caused a reduction in quality of carriers. Yeah, the quality of you know, how the work was carried out because, you know, there's a smaller pool of eligible carriers and even though they were better, the fact that there were fewer people to potentially match with the work trump the fact that they were higher quality. There's some sort, I think it's a Brazilian ant of some sort, that has thirty percent. I'm not I'm not saying that this is a direct analog, but I think, let's say they have thirty percent of ants in each colony that do nothing. Right, if you if you remove that thirty percent and come back a day later, there's another thirty percent of those ants that that do nothing. I'm not saying that there are shipps or carriers that that do nothing, but it there is some sort of stabilizing force happening, happening there. Well, that's super interesting that it actually went the other way, that something that intuitively, would you think would result in in better quality resulted in worse. Yeah, and there's a great example why it's so important to experiment. And when we run an experiment here, we've started this nice tradition where people vote on which outcome they think will win. So it gets...

...the whole company involved in experiments and typically the UX team give the winners donuts or cool stickers or something like that. So we've had, you know, many experiments like that where what you expect to happen doesn't happen, because it's very easy to fall in love with some feature that you think is amazing and the reality is is that it's it gets ever harder to, you know, find something that's going to move the platform forward. In a sense, these types of experiments you'll running some sort of laboratory right. Yeah, we are focused on our business questions. If this is proving to too much into company strategy or private material, just just let me know. I'm just wondering how you guys quality of carriers or quality of a delivery or anything like that. There's as industry wide problem in trucking which is ten percent of the time roughly, when a carrier's committed to take a load, they just no show or they tell you what the last minute they're not going to take it, and usually the excuse is my truck broke down, and trucks don't break down. Ten percent of the time, what it means is that someone offered them a higher paying load. And there's some carriers who do this all the time and there's some carriers like you know. I've looked at their data where they've done a hundred, three hundred trips and they like basically never fall off. And so fall off is one of the key components we use to measure quality, because it's super expensive for us when it happens, because we're committed to providing a really high quality shipping experience for the shippers, and so we have to then go find another truck who will cover the load at the last minute and that's really expensive to make sense. Is it any CON itderation with respect to the advent of self driving console self driving trucks within within your company's all? Yeah, I should just say there's some other things that we think about in terms of quality, like how all please, yeah, you know, on time percentage of the driver, like we're trying to get them to use the APP, is really important because that allows us to drive cost down and compliance...

...and safety. So all those are really important things. But in the specific experiment I mentioned, fall off was the main thing I worried about. We do find that if we can get carriers to start using the APP, then all kinds of good things happen and are possible. And so another thing that we do is data scientist is a very economicsy thing, which is think about how to structure the incentives on the platform to get the behavior we want and so like. For example, example of that would be if a carrier uses the APP, they get quick pay and that means we pay them the same day. And you know, the standard norm in the industry is carriers get paid about thirty days after they do the work. Which means they typically sell the liability to a factoring company and lose another three percent. So we're effectively giving them a three percent raise if they you know, it's like two to three percent for the factor so we haven like a two to three percent raise if they use the APP. Yeah, it's a race and it also it means I have more more liquidessence as well. Run right like. I don't know about you, but when I do work, I like to get paid and if I have to wait thirty days to get paid, it's not a pleasant experience. I've got to buy growthries and rent exact actually like parts. How about with respect to the advent of self driving cars and self driving trucks, is that something that that you guys are actively thinking about? A convoy? Yeah, we are definitely actively thinking about that and I know the founders have spent a lot of time, through their network, being very plugged in and on top of that, at the end of the day, it when self driving trucks show up, and it'll probably be in phases where they're different levels of automation, they're still going to need to connect with freight and we have a platform that does that. And so you know, our goal is to be able to just integrate that on the supply side of the platform. We've been talking all about the the impact of Dita Science on trucking with respect to the work you do it, convoy, and as with kind of we've approached ditosions from a variety of different different directions and it's clear that a lot...

...of things play play into this, into this discipline, and you, for one, you've got a background in well, your computational economist, xphysicist and also previously a research scientist at Amazon. So my question is how to all of these disciplines and histories play into your role in what you do. Is A ditoscientist. I think you can never have too many tools and you know, when I was a young physicist I was really lucky. I worked with John Wheeler and he said used to say besides channeling nels B he would say never quote anything until you know the answer, and there's a very famous wheelrism, but what he means is you should have a sense for what's the right answer for your scientific problem. And there was a famous example with fieman where fieman came in and thought he'd proved something and Wheeler said you're wrong, and fieman was really annoyed because how could wheeler have done this problem? Fieman just did it, he thought for the first time and there was an error in fineman's calculation. Wheeler's knowledge of physics was so deep he just knew that it had to be wrong. Didn't make sense. And so you know things like if you met, you know, go out and run an experiment to measure lift on your direct mail campaign and you get ten percent like that is probably wrong. You know, he just wouldn't expect that to be true. So I think physics is very helpful and that way in terms of being scrappy and, you know, building up mapshops, particularly Linear Algebra. I think Linear Algebra is super important for success and data science, perhaps more so than calculus. I economics gave me theoretical tools for thinking about business problems, as well as econometrics, tools for confronting the theory with data, and my time and software engineering gave me the software school skills to turn statistics into code. You know, everywhere you work, you know hopefully you're gaining new skills and you're learning that. I think that's super important for data scientists. I think also culturally. Amazon teaches a very adult way about thinking about problems, you know, having a sense of urgency, being focused on impact. It's a lot...

...like being in a PhD program where you learn to ask yourself the question regularly throughout the day, is what I'm working on going to get my PhD done? And if the answer is no, you're working on the wrong thing. Now it's time for a segment called principles of data science education. I'm here with Jonathan Corn Ellison, Co founder and CEO of data camp. What's up, Jonathan? Hey, here guys, thanks for having me. For sure it's always a pleasure to chat. What are you going to tell us about today? Well, I'd like to chat about the educational and practical principles that underlie how we think about data science betagogy here at data camp. You know, Carnegie Melton has not a great deal of research on learning and there's a one quotation from their work that that sticks had for me. What's that? To develop mastery, students must acquire component skills, practice, integrating them and no when to apply what they've learned. That's a great quotation and clearly delineates three aspects to the educational experience exactly. There you have one, learning skills and concepts, to you have practicing them and three, you have applying them. And this is precisely how we have approached building out products and contents here at data camp. So data cap started out by building courses. Yeah, that's right, and are interactive courses support our mission to help people learn data signs. By doing over the last four years, we've grown our course library to over one hundred courses spread across our pythons, sequel, get and Bash. So this covers the first aspect, learning. How about the second, practice? So we then developed a practice mode that allows dating computers to practice the skills they've learned in a course by repeatedly taking short sessions consisting of, say, five to ten challenges. The secret to deliberate practice is consistency and repetition, and so we designed the system such that it is easy for everyone...

...to get started. On top of that, and that's really exciting, we recently launch our mobile at which which allows you to keep up practice on the go. The mobile APP is so much fun. And then we come to the third principle, applying all the skills and knowledge right right and to do this. We just launched a new interface called projects, in which you'll be able to work on tasks data scientists encounter in their daily work. Tell me a bit more about these projects. So projects essentially allow learners to take the skills they've learned and apply them to an end to end analysis on the real world task, using real world tools and workflows, and then to showcase their work. And in particular, the projects interface combines a Jupiter notebook containing the narrative and the code of a projects. What is sidebar? That gives you hints and instructions, and then, after completing the projects, students can download the notebooks and share them as part of their data science portfolios. So that's a woe wind introduction to how we like to think about the educational principles of learn, practice and apply and how we put them into practice. Thank Jonathan for the chat. Thanks you go. This was a lot of fun. Let's get back into our chat with Ben. I remember you gave a great talk at data science pop up Seattle called correctness in data science, and the reason I liked it is because you gave direction to what types of mistakes are made on what we can do. Is As a field to correct those in terms of building a well defined discipline which at the moment is, I suppose, of a conglomerate of techniques, concepts and applications. So I'm wondering if you could speak to what what you think the major mistakes that you see data scientists and are making today. Yeah, no, I thanks. I'm really glad you like that talk. I love it and we'll put it in the show notes as well so everyone who listens will watch it. Cool. First of all, I hope the viewers enjoy it, and so I think correctness...

...of scientific models is super important and a lot of people, particularly when they're starting out, think, Oh, my code ran successfully a producing number, it must be right. Well, back up. You want to make sure that's the right number, and there's a pistemiological framework for thinking about that that was came out of the nuclear industry, called verification, validation and uncertainty quantification. I'm indebted to Robert Rosener, who is my post doc supervisor and, more importantly, former director of our gone, for introducing me to VV and UQ, and they're basically are, you know, three parts to VV and UQ. The first via's verification that's making sure your code correctly implements the model, whether or not the models correct, and so that means you should do things like unit test. You can also generate synthetic data through monty Carlo methods with known parameters and make sure that you get the expected results and do things like that to make sure your code is correct. Validation is making sure your model has fidelity of to reality. So that's doing things like running experiments afterwards to make sure that your model is an accurate representation of reality. And uncertain quantification is about thinking about the limits to your model. You know, what assumptions have you made? Do they hold? Could something like a tsunami show up and take out your nuclear power plant? Maybe you should plan for that, and so I think those are some basic things. You know, I love to ask by engineers when I interview them how they know if their sequel is correct, and they usually look at me like this is a super crazy, weird question. But sequel is crucial, or whatever you're using to pull your data, because if you assemble rubbish data set, nothing you do is going to get better. You know you've even if you do super fancy statistics, you know you're not going to be able to fix the fact that you didn't assemble your data correctly. So it's important to be very methodical and check as you assemble the data that it is correct. So you should think about join plans. You should, you know, test it on subsets of the...

...data and, you know, make sure Aggregi of statistics make sense and distributions are or, you know, appropriate, check sensible things like, you know you didn't get ten percent lift on your direct mail campaign. And then the other thing that's really important to is models going to deployment. And so that means you may need integration tests or other tests to make sure that what's in production is faithful to what was developed in research. Yes, so where does a either way you've worked previously or convoyt point now? Where does a DITA scientists in terms of putting what they work on in into production? Yeah, that varies a lot by organization and group. So in some organizations or groups a data scientist will just do pure research and then pass things over the wall and engineering will do something magical. That's can be problematic. Things are often lost in translation. Many engineers are not happy if you give them our code. Hopefully they're happy with python. They convoy. We're trying to work so the data scientists own their model end end and that we have a machine learning platform that allows us to deploy models. Were not all the way there yet. We have some more work to do in that regard, but I think that's better for everyone, because then the engineers can just call against the data service to get whatever result they need. Also, the way we're organized a convoy. I think it's very conducive for good data science and that we are grouped into product groups. They all collaborate closely, but a product group will consist of a product manager, one or more data scientists and a bunch of engineers. And the thing that's great too, about our product managers here is that they're all super technical people. Most of them have masters in computer science or equivalent, a good MBA, and they can all write sequel. There's a product manager here. It's in the interview. I asked him a question...

...that involved a sequel, question that involved using left out or joint and he solved it in like sixty seconds. I think he's tired of this example, but like that's the caliber of pms we have here. Their technical, data oriented they can write sequel, they can understand Undergrad level stats and that's very helpful because that makes them advocates for doing data science correctly. And then we have a lot of extra social stuff to make sure that the data scientists continue to connect and collaborate horizontally. So, for instance, we run like a data science brown bag. You know, we have technical one on ones where I meet with other data scientists and make sure that they're heading in the right direction and answer any technical questions they have so they're not blocked. And I also think a technical product manager, as you say, a lot of a lot of wins, but one of the major ones is that that can really have have the conversation with you as well. Right. Yeah, they they're very invested in being data driven. They know that if they write a plan for a new feature, they need to work with the data science Tis to have a test plan or some other plan to, you know, verify and validate their ideas. And they're all advocates for using data and I think we have a super high bar for PMS, but it's it's crucial in an organization like this that they can, you know, participate in the data conversation because often they're driving research questions. An example of how data driven they are is the is male who's the chief product officer. He writes sequel. I mean this guy stays up late at night writing sequel to understand the business and has as deep a knowledge of the data in our data warehouse as anyone. That's cool. So you yourself, though, have, I must say, very impressive tool box of statistical, econometric chops and and and ditosionse techniques. Wondering what your favorite technique or methodology for ditosciences? Yeah, that's a great question...

...and it's kind of like saying, you know, what's your favorite bird, or you know, whatever, you know camera. It's kind of you know, the British expression, I guess, is like how long is a piece of string? So and nobody can answer that right. So I tend to like, what do you enter? What do you like using the best tool for the job. So you know, I've certainly I came from a PhD program that was very strong and panel data methods. You know, like there are people at ucl where I studied, like Richard Blunde and others who did a lot to drive that forward. So you know, that's that's a strength I have, is using panel data methods, but I you know, I also like a lot of the core ml tools. It really depends on the problem. I want to use the best tool, and so what I'm happiest most about is not using the tool but solving interesting problems. You know, I'm an applied scientist at the end of the day. I've worked on a range of problems from quantum cosmology to bioinformatics to, you know, trucking and, yeah, among other things. So, you know, having interesting problems is what matters and being able to find the right tool to solve it is important, and that also is in addition to tool, data and software. So we've discussed a lot about modern data science. What does what does future Dyta science look like to you? Yeah, I think we're in this amazing time, like if you've got math and stats skills, there's like so much data just exploding everywhere and I think we're going to have a very fun and interesting time until Elon Musk figures out how to put us all out of business. What will happen until then? Yeah, so, you know, at some point there will be tools that are going to automate a way a lot of the simple models, you know, I think you're starting to see companies trying to sell, you know, like commodity term models and things like that, which I think can be problematic because ultimately every company's data is unique and you might as well build your own term model from scratch. But you'll probably...

...see more commodity models and more tools that automate a lot of the lower hanging data science fruit. And so I think to have a successful and happy career you want to move further up the value chain where they're things that can't be replaced by automation, like Automated Feature Engineering, like, for instance, at a previous company I worked at context relevant, they made good progress to automating feature engineering for a large class of problems. And what Tut types of skills would would you suggest aspiring and even will seasonedto scientists develop in order to not have the jobs automated? So the first thing is you can never know too much math and I think that that's something that's really worth investing in. That starts at a young age. You know, when I caught at galvanize where they run a data science boot camp, you know, I can remediate lack of programming with someone in eight weeks. I can't remediate lack of math in eight weeks or twelve weeks. You know, that's years of study. So I think you should continually invest in math. I think, you know, after that you want to master the core algorithms and then you need to keep reading. It's really important to keep reading. A lot of people stop reading when they get into industry and, you know, particularly for the more advanced people, you know, you want to choose some specialization that plays to your interest. So experimentation is something that's particularly interesting to me, as our beajian methods, and these are both things that I've worked on going, you know, much more deeply in recent years. I mean I know a lot of people going into deep learning. That's a very compe adit of space. So I think there are a lot of other interesting and important areas in data science and so yeah, sure, if you want to go into deep learning, go into deep learning, but I think there's benefit in being a bit contrarian. I think so as well. So, with all that having been said, do you have a final cool to action for butting and and established diosciences alike? I think the main thing I would say to someone who's interested in getting into data science is understand that you're setting yourself up for a life of...

...learning and that it's a marathon. It's not a sprint. You know, it's like getting a PhD, and you need to paste yourself. This is something that could take you multiple years to pull off, and so you need to keep investing. So maybe you should watch a little less netflix at night and spend a little more time, you know, reading the relevant books, papers, Writing Code, playing with models and you know, if you don't have that excitement about data, there may be some other place you're more happy. So keep learning, keep reading, keep doing yeah, and for me really, you know, I think, and for a lot of us in in the profession, you know, data's like an Agatha Christie novel. If there's like this mystery in there and I want to unlock it and solve it and figure out you know, if it was colonel mustard in the living room with the candlestick, that's fantastic then, so you're living your ditosiance life as detective fiction. Yeah, something like that. That's incredible. I can relate to that a lot because the data keeps, just keeps giving I've got a colleague who who always tells us stuff. You need to you need to listen to your data. It'll speak as long as you're listening right right, and I you know. I think also I've seen you when you talk earlier about the mistakes people make. One mistake I've seen a lot of scientists make is they leap into modeling too soon without doing Da. Da is the famous two key term for exploratory data analysis and it's really worth investing in some time in eedia because you will discover surprising thing. So we're not join convoy. I started doing some Etia on the data and I found some data cleaning and outlier problems that had not been addressed and we fix those and got like a ten percent improvement the pricing model and that was like free performance. That's that's very telling. I mean that doing something like Eti exploratory data analysis, because it is it is tempting to just jump in and try to build models straight away and all all of that type of stuff, but I always encourage people to try to visualize the data in a hundred different ways and look at this sum...

...ray statistics and all of that tube of stuff before doing in anything else right now. I try to teach students of very methodical, standardized approach and that's one of the things I think is great about crisp DM, which is something we discussed in prep. SO CRISP DM is a cross industry standard process for data mining and it's probably the best workflow I've seen for a data science project. And it's really good to go through all the steps to make sure you don't leave anything out. So you start out with just understand the business problem, then understand what data you have, prepare your data model, evaluate and deploy and at any point you may find you know, some mistake that you need to go back and address in one of the earlier steps. So you start modeling and you realize, Oh, I didn't do my feature engineering writer, I need to add a feature to capture some key behavior. That the were the models failing. It's very good to be systematic like that so you don't duplicate effort or miss steps. And ultimately, in terms of correctness, which we've also touched on, I'd like to kind of find a way to make crisp DM and VV and Uq and I think that then you're in a very powerful professional and mature set up. Fantastic, and that really speaks to kind of a systematic structure for what future diosigns could look like or incorporate. Right. And then the other part of that is, you know, like the modeling is the fun part and getting to the answer, but the stuff that comes before it is super important and it takes about eighty percent of your time that cleaning day of preparing it. And then the modeling is like kind of fast. It's like this high and it's like boom, where to go? I'm back to normal, and I think that we will seem more tools. Hopefully that will make that premodeling period faster, because that's where you're going to get your big productivity gains is if you can become faster in that phase, and so that's a good place to invest. So you should learn things like UNIS and get really good at using all the command line tools and other technology or platforms that are going to help you get your data ready a model...

...more quickly. Exactly been it's been an absolute pleasure having you on the show. Thanks you go. It was a treat to be here and I wish you the best of luck with the shower thank you. Thanks for joining our conversation with Ben scranker about how data science is being used to revolutionize the trucking industry. We found out how the challenges facing an industry that predates the tech industry, such as trucking, can be overcome using modern tech and data science techniques. We need to keep in mind that this will have a huge impact on the ground, as being a truck driver is the most common job in a lot of states in the US. We also discuss the importance of running experiments of convoy and the need for more consistent and rigorous data science practices. Make sure to tune in for the next episode of data framed, where I'll be speaking with my l salmon, a data scientist who has worked in public health, both in infectious disease and environmental epidemiology. Will be talking about the role of Data Science in epidemiology, the impact of open source software development in data science and diversity.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)