DataFramed
DataFramed

Episode · 5 months ago

#99 Post-Deployment Data Science

ABOUT THIS EPISODE

Many machine learning practitioners dedicate most of their attention to creating and deploying models that solve business problems. However, what happens post-deployment? And how should data teams go about monitoring models in production?

Hakim Elakhrass is the Co-Founder and CEO of NannyML, an open-source python library that allows users to estimate post-deployment model performance, detect data drift, and link data drift alerts back to model performance changes. Originally, Hakim started a machine learning consultancy with his NannyML co-founders, and the need for monitoring quickly arose, leading to the development of NannyML.

Hakim joins the show to discuss post-deployment data science, the real-world use cases for tools like NannyML, the potentially catastrophic effects of unmonitored models in production, the most important skills for modern data scientists to cultivate, and more.

You're listening to data framed, a podcast by data camp. In this show you'll hear all the latest trends and insights in data science. Whether you're just getting started in your data career or you're a data leader looking to scale data driven decisions in your organization, join us for in depth discussions with data and analytics leaders at the forefront of the data revolution. Let's dive right in. Hello everyone, this is a dull data science educator and evangelist at data camp. Last week on the show we had searche missis on the podcast to talk about interpretable machine learning. Throughout episode we talked about the risks that may plague machine learning models and production and as these risks grow, what are the tools of the disposal practitioners to rate and understand model performance post deployment? This is why I'm so excited to have hackems on today's episode. Hakim is the CO founder and CEO of Nanni ML, an open source pythond library that allows data scientists to estimate post deployment model performance, detect data drift and linked data drift alerts back to model performance changes. Throughout the episode we spoke about the challenges and post deployment data science, how models can fail in production, some cautionary tales to avoid, why nanny ml is open source, the future of AI and much, much more. If you enjoyed this episode, make sure to rate in comment, but only if you enjoyed it. Now on today's episode. Okay, it's great to have you on the show. Yeah, thanks all for having me. I really appreciate it at all. I'm excited to speak with you about Post Employment Data Science, your work leading nanny ml and so much more. But before can you give us a bit of a background about yourself and what got you here? Yeah, sure, so I'm originally American, born and raised in New Jersey. My educational background is in biology, so actually quite far from data science. Originally I was mostly working on evolutionary biology and population genetics and at the last year of my bachelor's I did...

...a course in bio informatics and that was kind of an R and using like a bunch of machine learning techniques for calculation genetic purposes and also genetics and just generics in general and stuff like that, and that like really hooked me to the concept of what you can do using programming and machine learning and originally actually wanted to be a doctor and my idea was like using personalized medicine. That was what I was like super passionate about it, like, Oh man, using machine learning and genetics, you can get to a point where you can really, on a personal level, prescribes or in treatments and like really help people, like at scale, but personalized, and that was something I was super passionate about. And so then I was like, okay, before I go to medical school, I want to go do a masters of bionformatics and I ended up moving to Belgium to do the Masters at the K Luvin which is it's funny because I know data camp is headquartered a living. So it was it's nice, it's a nice city, and basically my motivation behind that was pretty funny. I wanted the highest, cheapest master's program that takesist for I owed provadics and you end up withlicative and it's like rank twenty nine in the world and it was six as a year. But then I really got hooked onto the data science machine learning side and I abandoned my dreams of becoming a doctor decided to just go full force into kind of machine learning and then I worked, but as a data engineer, data scientists. I started the machine learning consultancy, also with my co founders from N E M l, and then eventually we just saw that every time we put models into production we always got this question about, okay, what happens next? How can we trust the models? How do we know that they're performing well? And at the time there was a lot of smart teams working on what we call the MLOPS part of it, so the infrastructure behind it. How do we actually deploy a model like the serving and things like that, and so we decided like, okay, given our expertise, which is data science and the algorithms, not really the actual programming and being good software engineers, that's how we decided to work on the N M L and we also obviously thought it was super important. So I am very excited to discuss with you your work leading N E M L, but before I loved to anchor today's discussion and some...

...of the problems that you're trying to solve. So you know, over the past year we've had quite a few folks on the podcast because the importance of Emil ops and the different challenges associated with deploying machine learning models at scale. However, another key aspect of them is post deployment data science work, as in monitoring, evaluating, testing machine learning models and production. I would also add improving and understanding business impact. Here I'd love to start off, in your own words, walk us through the main challenges in post deployment data science. Yeah, sure, and maybe just a little side note on why we like using the world post deployment data science instead of just recie learning. Well, to trigg like, I don't want to come off as the random people who are just drying to create something new, but the way we see it is that, like, monitoring is a very passive activity and if you describe an email what it does to date, indeed a monitoring library. But we see the work so much more than just monitoring. Like, yes, the first step of post deployment data science is actually monitoring and knowing how your models are performing. But then, Um, we see that there's a whole host of there are things that you have to do. That's why we like to turn post deployment data science, because we feel like it is actually real data science and it will be the responsibility of someone with more data science skills and but maybe to go into what are the challenges of Post Deployment Data Science? I think the first and foremost is knowing the model performance in the first place. It's not so trivial whether you have ground truth or not. So whether you know after the prediction what actually happens in the real world, which is ground truth, or you don't have it, knowing your model performance is still pretty challenging from an engineering perspective and from a data science perspective, and so just having that kind of visibility is already pretty hard. And maybe to give an example of okay, when you have ground truth it's more of an engineering problem. So when can we get our ground truth and compare it to what our model actually predicted? But when you don't have ground truth it's much more of a data science and ALGORITHMIC problem. So, for instance credit scoring, where you have a machine learning model that decides if someone should get Alan or not, and the model predicts yes or no this person should get alone. And then when do you know when that prediction was correct or not correct?...

Either the person has to pay back to load or they don't pay back to loan with either way it's very far in the future. So you cannot calculate the performance of your model in the traditional sense. And then the second challenge I would say is models fail silently. That's, you know, one of our taglines. And basically the problem with machine learning models is that if you give it data in the right format, it will make a prediction, whether that prediction is right or not, that's not the model's problems. And so yeah, software for the most part can fail loudly, so like you have a bug and error, it doesn't run, so you know what it's not working, but with machine learning models you actually don't know when it's not working, and so that silent failure is really problematic. And then a third challenge I would say is that most data drift is virtual. So we didn't get into a knitting really technical details yet, but basically a data drift is when you have a change in distribution of your input variables to your models and most of it is virtual. And what does that mean? Is that it doesn't actually impact the performance of your models. You have data drifting all the time, but your model can actually handle it. So if you were just detecting when the data changes, YOU'RE gonna get a lot of unactionable noise and you won't actually be able to do anything with that information. And then, I would say finally, which is the most complicated one and I would say probably a lot of data scientists still don't have that much experience with it, is the feedback loops. So you have the relationship between the technical metrics and your models. They might change, but in general just having a lot of machine learning use cases where you're taking like predictions on a customer base, for instance, and you have a model that takes prediction on a customer and then the a different department does something like imagine at ure and model that decides who will cancel a subscription or not, and you predict someone will cancel their subscription and then the retention department sends them a discount and then the next month or whenever your models running again, you make a prediction on that save customer again. So then the model is actually impacting the business and the business is impacting the model and you have these like very interesting feedback loops where the model performance will definitely change. And then you can have things where, like when you were...

...building your model, you run a bunch of experiments and that you have a business metric of keeping churn below five percent, and to achieve that you needed a rocket, you see, for instance of point seven. But then over time actually the model performance has to go up, maybe to rocket, you see, to achieve the same business results, because maybe in the beginning you were able to easily detect the people who will churn and you took the first steps to stop them from turning, and then the people that are later get a churn, they become harder or other weird things can start to happen. So there's definitely lots of challenges still once some models put into production. I love the holistic list and I definitely agree with you that this is definitely a problem that is a foundational to the industry if we're gonna be able to really derive value from data science and scale. You mentioned here the models fail silently. Component Right. Can you walk me through the different ways machine learning models fail silently and why? Yeah, sure, so. Yeah, I already mentioned that most models fail silently. I would say there's two main ways, data drift induced failure and concept drift induced failure. So basically, data drift induced failure is when the input data to the model has changed to the point where the model has not seen enough data in the new distribution to make good predictions. So, for example, the average age of your customers was and now it's fifty. So the age, the individual feature of age, has shifted in distribution and maybe your model hasn't seen enough fifty year old customers to be able to take good decisions there. But again, it will just keep predicting on them. There's if nothing happened. And so that's the silent part. The second part to it is the concept drift indust failure, and this is a change between the relationship of the input variables and the output. Machine learning model is basically just trying to find a function that maps inputs to outputs. That's basically what you're trying to approximate it as best as possible to the real mapping function that exists somewhere, you know, the ethereal space of reality. So that can be caused by, like, the actual behavior of the underlying system changed, most often by a variable that's not included within your model.

So, for instance, like your actual customer behavior has changed. So maybe now your twenty five year old customers are buying cheaper products because the economic conditions have changed and you don't capture economic conditions in your model. And sometimes concept drift induces data drift, so, for instance, in this case you would see the average price of the products your customers are buying decrease in time. But sometimes concept drift can also be silent where it actually doesn't impact any of the data in your model, but something did change in the real world that you're not capturing and so the fundamental behavior of the system is different, and then the performance from that can suffer, and both of these can have either um catastrophic failure or gradual degradation of performance. I love that distinction that you make between concept drift and data drifting, harping on the catastrophic failure graduate degradation of performance. You know, a lot of the problems that you discuss here have those consequences that can range from harmful to say the least. Right, as you said, catastrophic for an organization using machine learning in AI. Can you walk as through this range of consequences for badly monitored machine learning in AI systems? I mean, I guess the first thing can be nothing, depending out how impactful your use cases. That's the funny thing. In this kind of space is that your model is only as valuable as the underlying business problem at the end of the day. Right. So if it's a model handling some fringe cases or something in your company that doesn't generate a lot of value, then if the performance changes, maybe nobody will care, right. Or maybe, depending on your processes in your company, the model doesn't actually do anything by itself, like maybe outputs stata frames and then they're important to an excel and then not shared with business and then business looks at the results and then there's actually no automated process in there. So it could be that it's less important to monitor, and I would say that's the lowest thing. Then you have like gradual degregation. So yeah, maybe just one thing like monitoring becomes essential, like really essential, whether it's when you're models and like mostly automated systems. Right, it's important before that because then you want to know the business decisions you take on top of it if they're okay at Ault, but if it's like in a turance system where if a model predicts someone...

...is churning and then an automatic email camp page goes out to give them a discount or whatever, however your company wants to register. That's what monitoring is really really imploytant. And so casual degregation is just over time the data is drifting a bit and your model just becomes less and less performance. That can be a little bit like the feedback loop thing that I mentioned before, where over time you're already identifying the people who are more likely to turn and you're getting them out or like you're stopping them from churning. And so then the people over time just become harder to detect, right and things like that, and your model just gets worse and worse. But it's not anything catastrophic and maybe it causes a ten percent loss in performance. And again this all depends on the underlying use case. Right. In some use cases ten percent loss might be like whatever. In other use cases that percent loss might be like is. It's really depending on whatever business use case you're working on. And then you have catastrophic failure, which is a lot to worse. We've seen a few of those in the data side space. I always point out to Zillo, where they basically systematically overpriced seven thousand houses by three hundred billions and then it's just collapsed and actually their market cap dropped by thirty billion dollars. So that's yeah, yeah, and they fired the entire company, like they shut down the division that was buying it's selling houses, and that fired everybody. So like catastrophic failure, and also maybe a non financial catastrophic failure could be like the Chatbot Day from Microsoft. I don't know if you remember that, the the one that was chewing out a lot of harmful content that was trained, nor reddit content. Yeah, yeah, yeah, I've got racist and terrible real fast. So these are kind of the catastrophic failure where maybe if you're introducing some systematic risk that you don't realize and then all at once it just collapses and lots of bad things happen. And then finally, I would say, there's discrimination and bias, and the main impact from that is like yeah, it's not moral. Can Be pretty bad pr it's just not...

...good. And it can be that when you build your model that you didn't see bias in certain demographics because you didn't have enough of them in your data, and then maybe over time more and more of a certain demographic enters your data and the model can't take good decisions on them. And that would obviously bad, also from a financial perspective, because if you can't take good decisions on a certain segment, you're obviously not doing the best for the company. But it's also just not fair to the people that you're taking predictions on. Obviously that whatever it is that's impacting them would be discriminatory and not doing the best it can do. So so I would say those are the impact of what can happen also. That's really great. And do you mind expanding maybe into that Zilo case, Cety, just for a bit, because I think this is a very interesting case city for data scientists that are deploying machine learning in the wild, especially once you have data science becoming foundational to the company's business model. So do you want to expand maybe on how that failure happened, as well as the underlying issues that led to that catastrophic failure? So that's a good question. We could only speculate because actually don't know, and also some people claim it's not a data science problem. So you can like postulate, like how can that happen? So the thing that's interesting with, for instance, House Price Prediction, which is essentially what they were doing. If I understand correctly, they were like having a machine learning model that decided the price that they should buy a house, and so you essentially predict the price of that house. The problem with that is you don't have any ground truth, because the price that the model predicts is the price you buy the house at. So the prediction becomes reality. So you don't know the quote unquote performance of the model in the real world. Probably when they were building a model, they had a bunch of house prices and they tried to predict it and then they measured the performance like you would in whatever kind of data science system. But once you put it out in the real world, there is no real price because the real price is what the model makes the price. And so you can see how, if you cannot really calculate that performance and you're introducing these little systematic errors over time that pushed the house price higher and higher for whatever reason. Yeah, eventually you can realize that you have a huge portfolio very overpriced houses. That makes a huge problem for you.

That's incredible and it's very interesting, especially when you mentioned here how the prediction becomes ground truth. There's this feedback loop, as you mentioned, where the machine learning model becomes reality and companies can't escape that to certain excient without really proper modeling. So, in contrast, I feel like software engineering, it strikes me as an interesting aspect of data science that post deployment work. You know, emibs in general is still to this day not have been codified and not as yet matured around a set of best practices, tools and rituals thought and data science. Why do you think this is the case? I would say the main things that it's still early days. If you look at data science in the field, of course it's all like. I don't want to get any statisticians mad, because if I say data science years old, I will get a hold of very angry statisticians that will be like, yeah, I was doing machine learning in the seventies at a bank somewhere, yees, so we know, you know. But the data science as a field in non financial industry, let's just say that that's real satively new, and so I think that there's a big learning curve and it also comes with the concept of risk as well, by the way, because you have these financial institutions who have been doing this modeling for the past seventy years. I think maybe even longer, and they developed all of these processes around handling the risk that comes with taking decisions based on mathematical modeling. They have entire risk departments, validation, they have this whole process. The government even regulates it. Right, if you want to do a model for a credit scoring, the government has benchmark models that they can bear your results on. Like it is extremely regulated and well known how to handle that. And I think the problem is when you have, for instance, like a grocery store or a media company who doesn't have a risk department and they don't have any kind of inherent understanding of risk, starting to take decisions based on models and like really can really impact their company strategy, right, because if you think of like a churn model, if you see a lot of customers are turning, you might say, Oh man, we have to change how we're doing our marketing, or if you could really take big decisions based on what needs coming out of out...

...of your machine learning systems. But I think it's mostly chalked down to being the early days and people not having, in general, a lot of experience with these kind of things. And, to be honest, as much as as data scientists love to feel that data science is already everywhere. There's not that many models in production. Unfortunately, it's still very early. Like you know, what I like about being a monitoring company or nanny M L in general is that it's a very good litmus test for weather a company actually has bottles in production, because sometimes I feel it's like, yeah, we have bottles in production or doing data science and it's so cool. So you can use Nann Emil or another monitoring library and it's like, oh no, we're not ready for that yet. The models are not actually in production. It's just it's quite of funny. I think the latest survey I found is like ten of models make it the production only that companies evolved. So they definitely agree with you. Yeah, so it's just the wild west, and I think that's normal that you see in the early days like this kind of whole mess of different practices and hundreds of tools and that over time, is it mature? Is I think it will become much more clear on what the best practices are and how things should kind of be handled. So I think that's a great segue to discuss how an any email aims solve a lot of these challenges. So can you walk us through what nanny email is and how it works? Yeah, sure, so. Nanny email, as this day, is an open source python library, so data scientists can just pip install it and basically use it to detect silent model failure. Right now, the way we see most data scientists using an email is like in a notebook. They run an analysis on their models and production the data they have. But you can also then, of course, deploy it. It can be doctorized, it can be, you know, deployed however you want and have it monitoring in the traditional sense and near real time or batches, whatever you want to call it. But Nanni email in general it has three main components performance estimating and monitoring, data drift detection and intelligent alerting. And so basically, with performance estimation, that's the whole reason we're doing an email. We spend a...

...lot of time doing research to try and find some sort of methodology that would allow you to estimate the performance of your models in the absence of ground truth, so instead of having to wait a year for your credit scores to know if someone defaulted or not or paid back the loan, that you can in the middle actually estimate what the performance that your model would have on the current data in production, and that was quite hard and lots of research, but it works pretty well now. I've probably not the BEST PERSON AT DADDY E L to go into the gory details of this, like I'm a data scientist, I've got the research guy, but I could. I will try as best as possible to explain it and then my data scientists to all yell at me and say that I'm stupid. Just just kidding, but basically my understanding of it is that, at least in classification models, your model it outputs two things, a prediction and like a confidence score, basically, so you basically know if you have a class zero or one and how confident your model would be about that score. Imagine you have a prediction one and you have a confident score like point nine. You can kind of say that basically the model is correct for of the observations where it predicts one, and you can use some magic from there to basically reconstruct an estimated confusion matrix and from there it seems that it captures all changes in performance that are due to data drift actually, and from the estimated confusion Matrix you can just get an estimated rocket. You See, are an estimated fon or like any machine learning metric, and it's estimated without the ground truth. And it seems that if there is a change in performance and it's due to data drift, we know that the performance has changed and we know by how much it has changed because you get the actual performance metric and as far as our experiments can tell, it can do that quite well and quite consistently across all use cases and all kinds of data sets. And then the hard part is comes with drift. Of course it's so we cannot capture change and performance to the gods of drift quite yet, but it's something network quite researching quite well and also probably that explanation was not extremely coherent. So please go read our Docs and you get a much...

...better explanation than O and I just gave about our performance estivation. Yeah, and then the data drift. That's more run of the mill. There's two parts to it. We have like a univariate drift detection, so that's basically when the distribution of an individual variable changes, so like age, and then you have multivariate drift detection, where an an email gets a bit fancy again, where we developed our own algorithm for detecting multivariate drift. so that's basically detecting drift in the data set as a whole and the relationship between variables. And there it does something basically with a pc a reconstruction error, where you basically do a p c a on a reference period and then you do p c a on an analysis period and then you can compare how different those pc a s are and you get an error and that error will tell you how different the data in your references from your analysis. So then we basically also alert you when changing in performance happen and try to point you in the direction of the data drift that has potentially caused those changes. Right, so a bit more actionable and try. Basically, I don't it's not caused. So I can't say caused. We don't do calls obviousally learning just yet, but basically data drift that happened at the same time that your model performance has changed. So more correlated. Yeah, and that's it. Are there any examples of NY ML being used in production today? So since we went open source, we've seen, and this was our hunch all along, is that the performance estivation would be particularly useful in like credit scoring, and so we've identified a bunch of users in the financial industry using N E M L for credit scoring, and when people come to ask us about any Emil it's often from like that financial industry and credit scoring. But that's that kind of goes to the importance of the underlying use case and how obviously you don't want to give loads to people who should be getting loads and the financial incentive there is very high. So if we can help reduce the error there, and obviously they're gonna be very happy to use an an Emil. And one of the sad things about being open sources that it's relatively hard to know who's using your software...

...and what they're using it for. We try to identify people using it, we try to talk to who you would define as our ideal user and see who's using it, who's not using it, how are they using it, but in general it's pretty hard to identify them. And basically we also because we're still obviously everyone should be doing this, but in the early days it's super important to work closely with users as well. So we have a series of design partners where we basically work together with them to deploy an an email. It rate on the library and things like that, and there it's really varied, like from term prediction to demand forecasting and things of that nature. So it's really all over the place there. Yeah, so one thing that I found interesting about in an email is that you decided to make it open source. Can you walk me behind decision making here for why making an email open source and what are the pros and cons of going open source? is a up and coming machine learning package? Yeah, so we were doing, like I said, we were doing a lot of research behind our algorithms and we spent quite the long time working with design partners, so we had like real world data and things of that nature to run experiments and make sure that we can build algorithms that work well. But the kind of feedback we were getting from our users were like, okay, we're data scientists and if we're using a novel algorithm, like we need to know how it works, like we can't just trust you that we're just gonna, said to put data into this system and get back some results and we don't know exactly how it works and just trust that our performance is all fine. And everything is fine. So that was this guy in consistent feedback that we were getting, and then from there it was like, okay, our users want to know how this works, and then we're like, okay, if they know how it works, then it might as well be open source. But in general, also for widespread adoption, like before we were open source, we were kind of like no, name Belgian startup, having our designed partners and like getting feedback was very slow, basically because we're working with these big enterprises. They don't have that much time. Iteration is card, and then we were like this is not nice. Let's like, let's go open source, and that's,...

I think, what makes the most sense for a lot of data science products. So basically we just wanted to reduce friction, allowing as many people as possible to use it and to give feedback and ultimately building a better product for the data science community. And I would say the main com behind being open source is essentially, as a startup, you have to find product market fit, twice so for your open source solution. So you have to get mass adoption of your open source solution and then, after you've done that, or when you're well on your way to that, you have to build a paid offering so you can still exist as a startup, and then you basically have to find product markets paid offering as well. So that's definitely a big challenge. I couldn't agree more, and I love the fact that it's open source and like that that you're leveraging open source to be able to accelerate the feedback loop. I think the crux of the challenges that in any ML is attempting to solve. It's kind of a tension between what data scientists are trained to do versus what is expected of them in the real world. There's a lot of data work around pre and post deployment that is increasingly crossing over to the engineering realm and to first start off, I'd love to know how you feel a modern data team should be organized. Do you believe that a one size fits all data scientists that can do both the data science and deployment work, or do you think that these capabilities should be splintered off within the data team? That's a very good question. I think, as most things in data science, it depends. It depends on the size of the TV, if it depends what they're working on. It depends how advanced the team already is. Um I think in bigger companies, when you have dozens of use cases, specialization becomes necessary and you already see that you have data engineers, data scientists, m l engineers, but there's a data analyst, b I. There's this whole slew of roles for the data team and I think that makes a lot of sense. I think as companies become more advanced and more and more models go into production, you're actually going to have it between pre and post deployment data scientists and I think that, like right now you see a lot that monitoring falls under the roles of the ML engineer, but I think...

...that m l engineers are going to move towards mostly the infrastructure and the offswork and you're going to have this post deployment data sciences who will take over the models in production from a data science perspective. So again it's like how are models performing? Are they providing business impact? What are the feedback loops and like really doing analysis and working also to making their models better and to increasing their business impact, and that's a whole set of skills on its own. So you can imagine if you have I don't know, ten plus use cases in production, like you can have an individual who's just in charge of all of those use cases once they've been in production. At least that's how I think it will go. It's the future, so you never know. I could absolutely totally wrong about this, but yeah, I think that's what's going to happen. What do you think are the best skills of modern data scientists should have today? So putting models into production, I think that will really set you apart, or if you have any kind of experience with trying to really build a model and actually put it out into the real world and see what's happening. I just think it's one of those important skills that not that many data scientists have. And also really having this feel for the business impact of a model right, like a model is not just it's performance metrics, like it's technical metrics. It's also like why are we building this model? What value does this model add? How is this value changing over time? How is it impacting other departments and the people around me? And I think it's like not a technical it is a technical skill at one head without the other way. It's also just deep intuition about why you're doing what you're doing. I find that super important. Yeah, and just being able to detect these changes and performance, understanding these new concepts like data drift detection, concept drift and just really knowing them well and being able to use tools allow you to do that. Of course, I'm biased and all of this because because Daddy M L is what I'm working on, but I genuinely also think that as well. Um, yeah, now, as we're closing out, came what are the top trends that you're looking for within this space that you're particularly excited about, both in post deployment data science but also in data science in general and Post Deployment Data Science? I don't it's hard to say. That's really really early days. I maybe more like M alops type things like I'm...

...really happy to see all of these kind of frameworks coming out that make the engineering side of putting models into production much easier. Shout out to our friends at like Zen ML. That's like this up and coming framework for M alops and they're really great and I see a lot of the work they're doing and people actually ask us if we integrate with them and things like that. So I I really like these kind of tools and they're also open source, so plus one for that. And and that makes me really excited that people want to get models into production and that there's people working on these problems. So that's really cool. I think in data science in general, I really like anything that's generative. So like Dolly that came out recently and you see all these insane images, or gpt three in general, like just the texts that it can produce and things like that. That's like, I don't know, it's super exciting. Also, I do not, I do not buy all the hype about Ai and I don't know. There was this Google engineer recently that said that the Google a I sent it. That was just like no, it's not, no, it's not. I don't buy either. Yeah, yeah, yeah, I don't. I don't buy that, but I do find it cool. The air community is rarely united over things, but this one was very much so. Like the air community is super united or not being this is not sentient. Yeah, yeah, yeah, indeed indeed. It's. Like, dude, it's just it's a bunch of functions put together. Please, like you trade the club like an insane amount of data. Of course, it's going to tell you that it doesn't want to die or like. It's kind of funny because it's not like it's not weird that ai or machine learning models trained on text or language that's by humans, that it would then exhibit the same behavior as humans when you ask it those questions. Like it's very interesting. But yeah, all that generated stuff, I find it super fun and super cool and also very useful. Likewise, I'm super excited about what's ahead as well with GP three in tally. Now how came. Before we wrap up, do you have any final call to action to the audience? Yeah, I think the first one is a bit Cliche, but data science work is very impactful and I think a lot of people...

...don't realize how much it can impact people society in general, and actually it's a pretty powerful tool, and so it's like this cliche. I don't be evil, like be be conscious of your power as a data scientist and what your models are doing and how it might impact people in society, and really be conscious of it, because I think it's sometimes taken very lightly. It's very funny. I was having a conversation recently with a software engineer. He asked, yeah, how come there's not like this machine learning system that just analyzes all the cars on the road and the text if someone is a drug driver or not? I was just laughing and I'm like, I don't think that's allowed, and then he was just like yeah, as a software engineer I never have to think if my work is allowed or not. I just build and I'm like yeah, indeed. That's why I supported to have ethics courses in in different in different programs. Yet yeah, indeed, so I think that's yeah, just be conscious of it, don't be evil. And then, of course, if you're having models in production and you're interesting to monitor the performance, you can check out in an email. We're on Github, totally open source. will be open source forever. Our Core Algorithms will always be open source, our research will be open source. So, yeah, great to have anybody in the community that that finds this interesting. That's awesome. Thank you so much, hocking, for coming on data frame. Yeah, no problem, thanks all. Thanks all for having me. You've been listening to data framed, a podcast by data camp. Keep connected with us by subscribing to the show in your favorite podcast player. Please give us a rating, leave a comment and share episodes you love. That helps us keep delivering insights into all things data. Thanks for listening. Until next time,.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)