DataFramed
DataFramed

Episode · 5 years ago

#7 Data Science at Airbnb

ABOUT THIS EPISODE

Airbnb's business depends on data science. In this episode, Hugo speaks with Robert Chang, data scientist at airbnb and previously at twitter. We'll be chatting about the different types of roles data science can play in digital businesses such as airbnb and twitter, how companies at different stages of development actually require divergent types of data science to be done, along with the different models for how data scientists are placed within companies, from the centralized model to the embedded to the hybrid: can you guess which is Robert's favourite? This is a hands-on, practical look at how data science works at airbnb and digital businesses in general.

In this episode of data framed, a data camp podcast, I'll be speaking with Robert Chain, data scientists at AIRBNB and previously at twitter. will be chatting about the different types of roles data science can play in digital businesses such as air being, be and twitter. How companies are different stages of development actually require divergent types of data science to be done, along with the different models for how data scientists are placed within companies, from the centralized model to the embedded to the hybrid. Can you guess which is Roberts favorite? This is a handson practical look at how data science works at Air B and B and digital businesses in general. I'm Hugo bound Anderson, a data scientist at data camp, and this is data framed. Welcome to data framed, a weekly data camp podcast exploring what data science looks like on the ground for working data scientists and what problems it can soult. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and data camp. At data camp. You can find all our episodes and show notes at Data Campcom community podcast. Robert, welcome to data framed. Thank you for having me super excited. To be here. I'm super excited to have you on the show to discuss all types of things about about data science, but in particular data science in tech and data science at air be and be, where you are now, and twitter, where you work previously. But before we get there, I'd like to discuss you. Okay, what, what are you known for in the data once world? That's a good question. So I think for me I'm just like every other data scientist. But I guess I when I was a twitter I wrote a block post a medium called doing data signs at twitter, and that was a reflection of my experience of working as a data science, data scientist at twitter for two years and I just thought that I learned a ton about data signs and I was really I was...

...really grateful for learning all these stuff and I wanted to kind of share it. Originally, I should wrote it as a reflection for myself and so I didn't really want to share it, but then in the end of the day, I thought that a lot of the stuff there was could be very relevant for other people, so I shared it and very certain tipitously, it got picked up by Pete SCOMMERC which was a former data scientist at linkedin and he's kind of an influencer on twitter, and so he shared my block pose and I gush share about two other people and then, yeah, somehow the post got, you know, viraled and it's and at that kind of got my, I guess, blogging career going and since then I've written a few other posts, all related data science. So I don't know, I don't think I'm famous, but I mean the many people have told me and that they found my block post helpful. Well, I've definitely found your book both very helpful. I mean more recently you've written one two years on, reflecting on your time at it being by also which will we'll get to. But I do think your blog posts have opened the eyes of a lot of people working in data science or working as data analysts or working a data engineers or even people from other other disciplines into the daily ins and outs of what it means to be a modern data scientist, and that's one of the really exciting things about having you on this show. Great, yeah, I agree, it's it's really important. I mean, when I started writing the black post I didn't really have anything. I didn't have any specific goals in mind, but then over time I realized just how useful it was. I thought I was just stating the obvious, but then, you know, a lot of people gave me the feedback that they've found the block post very helpful because it helped them to mystify what data science is really about. In an environment where, you know, everyone talks about like modeling, machine learning, deep learning, it's really hard to understand what data science is really is and what it really is, especially in tech, which is our current that the subject of our...

...discussion today, and so I think really just sharing that those that information with people is a very useful thing for the community. Yeah, so before we get into a detailed discussion about data science in tech, I'd like to know a bit about how you actually got into data science initially. Sure. Yeah, so when I was in school data science was not really a thing, it was not really a profession I was. So I study operations research as an Undergrad and then later I study statistics in grass school and you know, generally and broadly speaking of I have always been very interested in statistics, computation and also data visualization. I was very deliberate, you know, even as I guess early on I was kind of very deliberate about how I want to build my skill sets, and so I actually very much enjoy planning all my courses and curriculum, and so I start taking a lot of classes in those three areas. And then one summer I came across a block post by Nathan Yeo, who is the author for a very great block of flowing data, and in that block post he mentioned about I think he cited this like Harbor Business Review Article on Oh, data scientist to you know, six thees job to torm for a century or something like that, something really like eye catching and even like click baiting, and so I just click into it and start reading about it and I read through his description and I realized whatever he was describing, the data scientists, this this this particular profession or role that he was describing was very similar to kind of the skill sets involves, the skill sets that I have been building, and so I thought that, you know, maybe there are are interesting industry opportunities that would allow me to continue to...

...work on that domain after school. And so that's kind of how I got started. And so these these three specific disciplines or school sets were statistics, computation and visualization. HMM, that's correct. And what? What type of research were you thinking about? What type of data sets we looking at or visualizing or doing computation on? What interested you at that point? At that point, I think it was. To be honest, I don't think it was anything specific. There are many well, there are many areas that I was interested in in doing, but I was quite interested in understanding using open data to investigate some of the more social issues, and so I didn't do I didn't do much public work when I was in school, but I was very interested in like basically taking these like governmental data that is like buried in the you know, bureaucracy and just kind of Dick them out and just trying to visualize these basic statistics in a way that it that that paints a picture about, you know, how our society work, and it really yeah, so that I would say that's probably the the area that I'm most interested when I was in school. I mean that's probably one of the reasons why I later joined a social network company, just because there's so many data that's being generated by people and it's it really gives gives you a glimpse of into like kind of how people share information, how they use information. That's kind of what I was interested. Are Great and so essentially, you were engaged in these practices of Statistics, computation and visualization and then notice that these, among others, had been rebranded or remarketed as as ditoscids in some sense. HMM, yeah, Oh, yeah, I should I should bring one extra example.

I think when I was taking a visualization class, one of the things that the one of the examples, and there are many examples, but one of the examples that really came to mind was, I think there was the Enron scandal and then somehow they like collected the emails, o the emails, and then someone like geniously enough to like basically parted out all the email communications and put it into a graph and it shows you how, like basically how people have been communicating up until the the scandal broke, and it would to me it was just like so fascinating to see how you people lead these like digital traces. That allows you to see that that really just it's another way of capturing the history that is that is otherwise not not possible. I've actually going through a bunch of those emails because they are public and there's old types of Wacky, you know, office romances, there's all types of really wacky stuff in there. And a couple of my old friends, the digital artists, and they actually did a project with the end run email set where you can sign up to their mining list to receive these emails in perpetuity for the rest of, you know, the next thousand years or whatever it is. That's great. That's great. So my last question about about what your history is. When you say computation, which is such an important part of data science today, we're not talking about pain and people, we're talking about using a computer or what languages were you using back back in the day? Yeah, that's a that's a great question and I had a lot, I a lot, a lot of thoughts to share on this this very topic. But the answer a question. When I was Undergrad, I actually really hated anything computer related. When I was Undergrad I did pure math, so I did a lot of proofs. So use a lot of pen and pencils and I try to avoid as much as I can on doing anything computation related. It's not until so I learned, I think I learned Malap when I was Undergrad I didn't really use it and then it was not until grass Chool, I...

...think. And there is one there's one class where, I think it's a stochastic program optimization course, where we were learning some properlistic models and I think it was metropolis hasting algorithm. I think I still remembered it, because that that's kind of pussess point. But like, we were asked to do some simulation and I was I was just, I discuss, stuck that I didn't know what to do, and then that's when I really started to realize there's actually a very huge gap in my education where we are moving into a world where there's just a plethora of data and there's so much data that can be leverage, but you can't do it by hand. You have to really leverage computers to to do these processing and that's when I started to realize, okay, I really need to get better at data manipulation or computing computation in general, and so I started. I pick up our to start with, and an overtime I moved to python. Nowadays I've been using a combination of our python UNUCHS, basically towards are relevant to to solve the problems at hand. So, yeah, it's definitely a really important area if you want to become a data scientist. Learning how to deal with data and how to do computation in addition to statistics, this is a very important skills have, and you found ah is the way you enter into the yeah, our was my first language. Are To me was kind of an interesting language. It was the language that taught me, and if you talk to other people you probably hear similar stories. For a lot of the statisticians and I guess, scientists, a lot of their intro introductory like programming language, is not really an actual programming language, not a gent like a general purpose programming language, but more a domain specific BG, usually scientific computation. I like language like our or...

Mal lab. Over time I basically learn more about computer science in general, just on my own to pick up other other stuff. Yeah, much rejectory is actually actually quite similar, and we're just about to get into conversation about data science in Tech. But I just wanted to take a slight detour and sure talk about the fact that people used to do stuff by hand. Right, I mean Pascal Firma, I think it was Daniel Bernoulli, who came up with the binomial distribution and the Bernoulli distribution. He set around and he, I think the story is he flipped a coin six thousand times and wrote and sort all the distribution appear then figured out the close form solution to it. But we do. Reverend Bays did the same actually, but we do. We do have such incredible computational power these days and and thank goodness, because of all the data that's coming through, we couldn't, we couldn't do any of that that by hand these days exactly. And you see a lot of these patterns emerge or re emerge in in a scitistical and even computer science world. You know, and statistics boost trap has been was as around for a long time but, like it's real power, wasn't able to be realized until there's really enough computation powers for people that to run these, to run these resembling schemes. And then you have the worlware of neural network where you know there was invented like as early as, I think, the six sevent S, and you know it's not an and it's not a new idea, but it just, like the computers, wasn't able to do it. Nowadays it's like growing really fast because all the GPUS and all that stuff, and so it's definitely the interplay of Statistics and computation. I think it's going to be increasingly important. So let's let's jump in from your experience of twitter, your current work, it being bay. I'm Gonnaso, a pretty general question, just running what, to your mind, of the biggest challenges currently facing digital business. This again kind of DIS relays to what we discuss earlier, which is computation specifically. I think as a start up, you know,...

...every any any business, digital businesses like you view want to eventually scale, you have to start from somewhere and it's actually very important to be able to build a data foundation for you to do any sort of analytics. And so by Bay Data Foundation I mean, you know, setting up the data infrastructure. So that might involves things like, oh, making choices of like what database are we going to use? Are we going to use, you know, put things on on Ammazon, on Amazon, on aws? Do we want to use htfs, all that kind of stuff, which I'm less familiar with, but like building a strong data found data infrastructure, is very important and then second it's important, once you have the the infrastructure, then to build your data warehouse on top of it, and so this involves in basically taking your raw data, organize it, aggregated, claned it and then put it in a form that is an analytics friendly so that other downstream users can can use. And so data infrastructure and data warehousing, I think, is really the bed rock of all kinds of analytics and only after then can you move on and to do more fancy stuff like this, is intelligence experimentation or building data product. So Data Foundation, I think, is a very big challenge for many different companies and that is something that everyone has to go through and it's very important. At a business like a B and B, Data Foundation will be incredibly important in terms of storing information about properties, uses, all of these types of things. Right, yeah, exactly. And this this work my sound trivial, but it's actually not, and a lot of problems will pose as a trivial problem, but it's actually not. So I think it was Monica Roghatti. She was also a former data scientist from linkedin. She had a famous tweet and said that basically, like the majority of data science, is just doing counting, but doing counting in a smart way,...

...and I feel like building data foundation is very much like that. In addition to data foundation, I think, or, as you say, once you have the data foundation, then if you look at that V you look at the trajectory of a a company, once you have the product market fit, that's where you really want to start growing your company. And so, you know, helping your company to grow and drive from an adoption and find more users who love of enjoy your services is really a big part of what data scientists typically do to help companies to to to scale. So that's something that I work very closely when I was a twitter and you know, if you look at all the top companies nowadays, there's inevitably a growth team that's focusing on driving growth, sustainable growth, beyond growth. Then, once you really have the scaled and I think that's when you want to start optimizing your business and I think that's where that's really where machine learning and then data product personalization can can help. They can make the product smarter, they can make it more personalize and they can make the business more efficient. I'm laying out in a way, I'm laying out of challenges, I guess, kind of aligned with the trajectory of how a company would grow itself in the data space. I think that's a really intuitive approach and it leads to a number of number of questions. We've discussed the data found stations aspect briefly. I'm interested in growth and what type of Beata Science questions or or techniques arise when thinking about these growth challenges. Your various ways to to answer that. One is that I think growth is very much data driven because a lot of the lot of the growth work is well, there are kind of good growth work and back growth work. The the bad kind of growth work or the socalled growth hacking, where you're very short term thinking. You're just trying to provide short term in census for people to that kind of sign up for your services and then they come and they use it for a little bit but then they don't...

...really retain. Those are not the kind of I don't personally, in my opinion, I don't think those are the good kind of growth problem to think about. I think the good kind of growth is you really think about the business, you really think about what is the values that your service is able to provide and then thinking about what are the product changes that you can make to expedite the process in which your users can start enjoining the the values that that your service provide. And so these are typically more longer turn thinking. But I should say that, either whether it is for short term growth hacking or a long term growth work, I would say experimentation. It's a very huge part of doing growth and because you can really verify what are your hypotheses are correct or not unless you run A ranimized control experiment. And so when I was on growth, that was the period where I learned a lot, a ton about online experimentation. That's an area that's very important. I think. Another there are a lot of other, I would say techniques that facebook use, which will both facebook and twitter use that I learned a lot. So I think this is probably invented at facebook. To it, there's this framework called growth accounting where you broke down your active users into different segments of users, so new user resurrected users, users would churn and users will retain. And then if you kind of break down your business by if you break down your active users by these segments, you can observe over time, oh, are you actually increasing the number, the share of new users, or are do you have a leaky problem where that your share of churn users are increasing? And that's a very informed view, informative view telling you what is the composition of the users who are using your services. So could you give me an example of an experiment that you you did a twitter that you found interesting or exciting? That's a great question. I've friend so many experiments at...

...twitter. I can talk about the first project that I that I was involved when I first joined twitter. It was a project that is less growth related, to be honest. So let me describe what it is. So basically it was a project where we as a social network, people get notifications and emails is to interact with our platform and we had a hypothesis that a lot of our users are receiving all kinds of notifications and they're getting they're getting fatigue by these messages and some of them are more useful than the other and so one of the things that we were trying to run is, is there a way in which we can identify a subset of email or push notifications which user don't really need to get and because they have low value, but and and and we have a hypothesis that if we are able to open identify the combination of that email and user set, then we can remove those emails from them. That will, one help them to have a cleaner inbox and to it will still not hurt our bottom line, like people will still find value, they will still come back, and so there are some data signs work going into it and figuring out what is the right combination. And then we ran an a b test where we have a control group where users will continue to receive whatever notifications that we have in the system, and then we have a treatment group where we hold back certain emails from the users so that we respect the users inbox. And then the the end result for us is to compare the result of the engagement metrics between the control group and a treatment group and to see if we are able to maintain the same level of engagements for users who did not receive those notifications. And Luckily it turns out that we kind of did it analytics right so that we realize basically the users who are most active, they're receiving, for example, if I get it, if I...

...get a direct message or a tweet from you, it's very likely that I will get the same notifications on my push notification as well as my email inbox. So there are in some sense kind of redundant and so if we can just basically remove one or the other, then that will make their inbox list clutter but then at the same time they are still get the message from other channels. So that's kind of one exam pole where it wasn't really used to drive growth, but it was using AB test to respect our users space and inboxes. That that's a wonderful example and it's a great experiment. But the reason, one of the reasons I think it's so wonderful, is because it's something we can all relate to write with, with the world of push notifications come in. I mean, you know, while I'm recording this podcast with you, I've got my phone on airplane mode for the for the obvious reason. Up Next we have a segment called data science best practices with Justin Boyce, Justin's lecturer at Caltech, where he teaches lots of courses for biologists and bioengineers and helps them to work with their data. Hey, justin great to be with you, Hugo. I understand you want to talk to us today about how you shared data? Yeah, I want to make a simple but important point. If you want to share a data set, make an informative plot that displays as much of your data as possible. I totally agree with that, Justin but that can't be the whole story. You're right, it often is not the whole story and there can be lots of statistical modeling and inference to be done, but the seemingly obvious notion that you should generate a plot that displays as much of your data as you can is often overlooked. All too often I see research papers, figures and news articles and, when I was working in industry, internal memos that do not do this. Really. What do they do instead? Most commonly, people are really for it a summary Statistic like the mean or median, and sometimes a confidence interval. This is often done in text, not with graphical plots, and sometimes the...

...confidence interval is not even properly described. There are often implicit assumptions of normality of the data and symmetric confidence intervals, that sort of thing. What's wrong with that? Well, there's nothing wrong with reporting some ray statistics, provided you clearly define the statistical model and assumptions under which they are calculated. But when you don't plot the data you can end up over distilling the data to just a few numbers. An effective graphic tells so much more of the story. In a graphic you can explicitly see what the measurements are and, importantly, how they might be distributed. Can you give me the example of a good display of Diadem? Sure one of my favorites was used to show how dominant Steph Curry's two thousand and fifteen, two thousand and sixteen basketball season was in terms of three point shooting. This graphic was made by the upshot for the New York Times. Will Link to it in the show notes, or you can find it by googling Steph curry upshot. Anyway, the people at the upshot collated all of the NBA players who are in the top twenty and three point attempts in each season since one thousand nine hundred and eighty. They could have computed the mean number of mad threes over all players and the standard deviation and then said that curry season was three or four standard deviations above the me. That distills it down but does not really show the whole story. Rather, the folks at the upshot plotted the cumulative number of threes made for all of these players during the respective seasons. This graphic is so much richer. It shows how the number of threes made by league leaders has been steadily increasing over time, and it effectively shows how completely anomalous curry season was. Yeah, that's a great graphic, but it takes up so much more spice than reporting a summary Statistic. Well, that is true, but your data are important. Show them some respect and give them space in your reports. I also think space constraints, at least for graphics versus text, are irrelevant. Reports, research papers, even news articles are distributed digitally these days.

Space is essentially free. Beyond that, if you want to make a point with your data, it is worth it to make a plot. If you don't have the space for that, you might ask yourself how important the inquiry for what you acquired the data actually is. So then, as a rule, plot your data. Yes, it's a simple but effective rule. Give your data the space they deserve. This has been a lot of fun. Justin let's chat again soon. You Bet. I talk today about the importance of just plotting your data in future segments. I hope to give some examples of effective plotting techniques, such as be stwarm plots and empirical cumulative distribution functions. I'll see you then. I'm looking forward to it. After that, to Lude, it's time to jump back into our chat with Robin. Talked about the three different different phases data foundation, growth and machine learning. Where is it being bat now? That's a great question. The answer is it depends. The business is organized into different Produd verticals and some verticals are very mature and some verticals are quite new. To give an example, the homes business, which is kind of how airbmb got started. It is very muture by now. And then the trips business, which is a new product vertical that that was launched just last year where, in addition to finding homes, you can find local guides who can help you to enjoy local like experience. That particular business is quite new. And so, depending on the face of the product, depending on where it is, the focus on data foundation growth versus machine learning might be different. So on homes the data foundation has been built and it's still being built, but the data foundation for by and larges there and we're continue growing a business and there's a lot of machine learning and...

...personalization day data products. That's that's being worked on for the homes business versus if you look at trips, really, because it's a new product, a very big focus is to focus on growth. So it really depends on which product vertical we're talking about out one thing that changed in your professional life when when you move from twitter to Airbnb was was a far stronger focus on machine learning. Right, that's great. And what type of machine learning you doing or involved in at AIRBNB? To give you a bit more context, when I was at twitter I was very much involved in product analytics, experimentation, statistical inference, and so when I transition to Airbnb, I made a pretty conscious decision to learn about other areas of data science and I want to do more machine learning and I want to do more sort of data basically modeling, statistical modeling, and so I was very lucky in the past, I was say two years at Airbnb, I was able to involve in some media modeling project and the one that I have been working on recently, early this year, is a project where we try to model the lifetime value of a listing when they first get book on Airbnb. And what's lot of talk about you? Yeah, so lifetime value. Can think of it as you know, if you have a listing and you put it on Airbnb, you will hopefully start getting bookings and an over time you will accumulate. Each time you host someone, you will get paid and if you accumulate these payouts, then over time these are the total sum of money that you're able to earn by participating on this marketplace. And so if we look far ahead enough, we will be able to calculate your lifetime value over your listening on Airbnb. And so...

...that's that's what are. That's why LTV, lifetime value stands for. As of why it's important for us, it's very important because, in addition to having very strong organic growth, we're also trying to grow our business in other ways and sometimes growing our business which involves costs, and so what we want to what we want to do is to be able to optimize our acquisition growth acquisition channels, supply acquisition channels, and what we want to be able to do is to calculate, for each channel, what is the return on investment. So if we put in one dollar to acquire listings from channel a versus channel B, how much money you will we get back in return? In theory, we can just wait for a year and then or way for a long period of time, and then we will get to really observe how much money each listing is able to to make and then we can calculate a return on investment, but from a business standpoint, from a in practice, we want to be able to make that decision of optimizing across different channels as early as possible, because want to be forward looking, and so this is where the predictive model for predicting the listing lifetime value can be very, very handy. So by taking into account a bunch of signals like the location of the listing, their availability, what it what kind of amenities they have, whether their historical performance and all that stuff, then we can start to build a historical data set and then use that training data set to train a model and the to predict for future listings that are on board with similar characteristics, what they're likely returns for them and then when we have those predictions, then we can start making these important decisions of optimization across channels, a lot easier than than just waiting for it for a year. That's a really good example of having a machine learning algorithm will challenge or task fading right back into Business Development.

Right. HMM, that's correct. And so that kind of leads me to my next question. Is, you know, data science, especially for companies such as they're being bi with, there is so much data. Data Science can have such an influence on any decision anybody within the company makes. So I'm just wondering how data science is integrated into the business as a whole. Yeah, that's that's also a great question. In in tech and silicon value typically companies organized data science, data science team in different ways and as far I guess I can tell, they're generally three models, and this might not be the only three, but these are the three broad categories or approach that I've seen how people organize data science team or how they integrate into the business. The first one is called the centralized model, where you have a team of data scientists basically sitting together working very closely trying to solve varied challenging problems, and then once they kind of solve it, then they will kind of push it, try to try to sell it to the the rest of the organization to adopt them, and so it's a very centralized approach where people are really taking the data scientists aretaining command and driving the row maps on how they will provid value for the organization. That's one extreme. The other extreme is what we call the embedded model, and embedded in the sense that you have a distributed data science team where data scientists are embedded with the product team. So just like a full stack product team you will have perhaps designer and a team of engineers, engineering managers, you will also staff a data scientists on the team, and the data scientists on the team on that particular product team is very much responsible for all the data related work for that product. So it could be doing opportunities sizing, it could be doing product analytics,...

...understanding how the users use the product. It could be designing experiments so to helped and iterate a product. It could be building machine learning models for the product. It could be a wide variety of things, and so these are people who are very, very much embedded into the product and they know very well the domain of the product, but not necessarily on other parts of the product verticals. One extreme is centralized, fully centralized model. The other extreme is fully embedded model, and it has its pros and cons and we can talk about it a little bit more. But the last the last model is kind of the hybrid of the two, where you have data scientists continue to be embedded with product team, but then you still have this functional team where data scientists do not report to like a product manager or engineering manager, but they report into a data science manager, and so that way you have data sciences all contributing to specific product areas and they're really the product owner and experts on on these specific produd product areas, but at the same time they are able to enjoy the right level of guidance and career growth from their immediate functional manager, and that's called a hybrid model, the hybrid model. So the answer to my next question may bee business specific, but in your experience, which is the most successful, both on a business side but also for professional development for data scientists? Speaking from the experience of someone who have experienced all three, I actually think the hybrid model is the best. The issue with centralized model is that a lot of time you, as a centralized organization, you have to despite doing very good work, you really have to push your work to other people. By people here usually mean like the product teams, but you can imagine it being very challenging because a lot of time in especially technology companies, they went through, you know, either Spring Planning or cycle...

...planning, where they are always constantly developing on what to build, and that's a very rigorous process and it requires a lot of attention and a lot of planning. So if you're not part of the team and you're trying to push your insight or even ideas or agendas for the other team to adopt, unless you have that right level of organizational relationship, it's really hard for your vision to be realized within the product and so I think I find that generally very challenging if you're on a centralized team. On the other hand, if you're in a company with an embedded model, I think that's the best way to to for your data scientists provide immediate impact. But then I think the downside of that is very could be times where data science is actually do not report to a data science manager. They report to either mostly into like an engineering manager, in which case the engineering manager might not know you know how data science work, because it's such a new field, unlike engineering, and so to me that end of the spectrum, the embedded model like really maximize the utility of the data scientists, but then at they sometimes at the expense of their career growth. So I think that's why the hybrid model was was created, which is to balance the to like business impact and career growth, and so I personally really enjoyed the hybrid model but also, as you say, balances the tradeoff between having a manager who can can speak the language of data science with you, but also feeling like you have ownership over the decisions that are being being made with respect to the work that you do. Exactly. I think that's quite accurate. We all know that digital businesses are massing lots and lots of data these days and for the for the past several years, there's been this buzz term big data but flying around. I was just wondering what your take on big data is and what role it plays in for digital businesses. Yeah, I you have big data is really a buzz word, just like, I guess, data science and deep learning and all these turns,...

...and so it's kind of really it's really hard to understand sometimes characterize what big data really means in my opinion, I think I'm sure a lot of people might disagree with me, but from the perspective of data scientists or a statistician or scientists, I think for for us, big data really means that when you run into problems where you're doing analysis or building a model, that data that goes into your work doesn't fit under a laptop. I feel like that's generally a pretty good test to to judge whether you're dealing with big data or not. Or, to put it out our way, there are actually many, many business problems which can be solved without using big data. And you know, this might be a digression, but if you if you consider a lot of the silicon valley top companies nowadays, you know people would build your internal our packages or python packages where a lot of the work goes into making the R and the python environment connect to the internal data warehouse so that you can just query data from the warehouse, put it down to your local machine and then do your analytics or analysis. And I would say a lot of the actual data science work actually can be done on your laptop. So I think, I think big data is not the you know, not the not the solution to all the problems that people encounter when dealing with data science. That said, I do think that there are many different problems really requires big data a really in order was real require you going above and beyond building your model on your laptop. And so, you know, just again taking taking my my the example that I mentioned earlier on building listing level lifetime value model. We have so many listings on the platform and we're tryinging to make predictions for all these listings, trying to make predictions by taking an account a...

...bunch of signals that we have. This is the scale of the of the problem. Is Really, really big and it's not something that you can just easily crank or something some model that I can easily train to in your local machine. And so that's really where you get to the big big data territory where you need to you need to start really think a skill. What is the best way for you to build this thing in addition to be on your laptop? And this is where like having a really strong data foundation team or a very strong machine learning strateur team can be really helpful, because those are two people will can really help us to operationalize the these big models exactly, and something you've spoken to the which I really like, is the fact that these types of businesses have developed infrastructures which allow you, on your end, to act like you're working with with smaller data, and a lot of the work to the big data is done on on the infrastructure side. Let's now jump into a segment called data science. Blog post to the week with data camp curriculum lead spends a Bouchet. What Post did you enjoy checking out this week, Spencer? This week I wanted to talk about a great blog post that hits on the relationship between data science and academia. I'll bet that a big chunk of the listenership on data framed either are or have been involved in academic research, just because academia has played a crucial role in the development of data science from its earliest days. But the best way for data science and academia to work together it hasn't always been super clear. So Jake Vanderplas has summed up some of these issues really well in his blog post the big brain drain why science is in trouble. Even though he wrote a way back in two thousand and thirteen, it's definitely still relevant today, so we'll definitely link to this post in the show notes. But one of the major issues that Jake identifies been so so. Jake correctly points out that as science becomes more...

...and more data driven every year, the ability to effectively process data is superseding other and more traditional research skill sets. More and more scientists must be broadly trained experts in statistics, computing algorithms and software. Sometimes it can begin to feel like domain knowledge is merely an afterthought. Actually, despite this shift, academia has actually been pretty slow to adopt incentives that encourage developing the data science aspect of academic research. And why is that bad? Well, it's bad because an academic culture that only rewards publications runs the risk of losing out on new, potentially game changing tools like Numpie, SCIPI and psy get learned researchers need powerful open source technologies like these to advance the frontier of their fields, so there needs to be systems in place to reward time spent working on them. As an academic who's a core contributor to both psychit learn and SCIPI, Jake knows these pain points really well. Although a lot of progress has been made in the years since this blog post came out. Still definitely a long way to go. The incentives for developing data science tools are just so much better in industry, where that skill set is rewarded with better pay and definitely higher prestige, to quote Jake. Actually, some of the most promising upcoming researchers are finding no place for themselves in the demic community, while the for profit world of industry stands by with deep pockets and open arms. Does Jake propose any avenues for improving or solving these conditions? Yeah, Jack's got several ideas in the post about how to prevent unnecessary brain drain of academics to Silicon Valley. One that I find particularly interesting is the idea of creating a brand new academic employment track that provides attempting career path for open source software developers. Check out his post in the show notes for more. Thanks for sharing this post with a spencer. If you dug that. Listeners, make sure to check out next week's episode, which is a conversation with Jake Vander plus himself, Jakes of data scientist, astronomer, renowned PYTHONISTA and open source based Jake...

...and I will be chatting about data science, astronomy, the open source development world and an array of other data related ideas. Thanks once again, Spencer. Yep, you got it, Dude. Time to get straight back into our conversation with Robert Chang. With all this data that businesses such as a being being, an and twitter have, there's a huge, massive possibility for for social research, particularly with respect to social networks. What are your views on this type of stuff? Yeah, it's actually, it's actually very interesting. I I find I find them fascinating and you know, to my understanding, I've never worked with facebook, but I am a big admirer of their core data science team. Your core data science team have done a lot of interesting research in trying to understand how information disseminates in network and kind of basically how information dissemination influence people's consumption and sometimes even decisions, right and so, you know, to make it more relevant to you know, our current events. You you we have heard a lot of a lot of things about, you know, fake news on Facebook, for example, and so it's really interesting to think about like when these when these ads are being shown on the network, how does that really influence people's perceptions and how does that really influence your decisions? I think there is a lot of work that has been done by the facebook data sign a court data scigence team and many other companies as well to study these kind of things and to me that's that's that's very fascinating because you know, before these like social network where they have this tremendous skill, it's really hard for us to understand the understand these things, and so it's really, really exciting to leverage the data generated by the social network to understand some of the human behaviors as skill. But you know, if I...

...want to play the devils advocate, this is I would say this area of research is not laugh it's not it's not entirely uncontroversial. Right. There are a lot of interesting research that has been done that makes people feel uncomfortable. For example, I think the famous one from facebook was they try to I don't remember the details anymore, but I think it was something along the line of they ran an AB test where they kind of slightly changed the were choices of their their posts and try to understand or not not changing the word put that with the words. I'm sorry, but like changing the sentiments of this post, start being presenting a news fee and then they try to like measure the wellbeing of the people over time, and then they're trying to understand and implications. These are things that people sometimes, I can find very uncomfortable because it's somewhat like manipulative. I guess, I'm sure that they when they do these research they have the best intention but it's very important to be careful to do it in a way that is ethical and and and doing it in a way that is now, I'm universities, but this is a this is a great point at a great question, because research labs historically have been held to certain standards with regard to experiments they can take with the general populace. Right. So perhaps where even thinking of a future in which big business and social networks in this type of stuff need to be held to similar standards. But, as you say, they like perhaps have the best best intentions, but it might not be up to them to decide that. Right, right, right. So, as we discussed at the start of this this great conversation, you've written a number of blog posts which which do a lot of things. One thing they do very well is they provide history of your path in indicto science and actually provide not only advice for aspiring data scientists but reality checks for will says and data scientists. So I was just wondering what advice do you have for people aspiring as data scientists now in in two thousand and eighteen?...

Yeah, that's a that's a great question and as a snee peak, I'm actually working on a series of new block posts on data engineering. So, if there's any piece of advice that I would give to aspiring data scientists, I think is to learn a little bit more about how to build data pipelines, also known as etl jobs. The reason I mentioned that is because, you know, over the course of my career so far and I have observed, people who are able to liverage data engineering as as an adjacent discipline are the people who are able to take on more ambitious and media projects over time or over their career path, and I think that's something that I think is perhaps undertaught or under it's not really obvious to many aspiring data scientists because there are so many buzz around like just, you know, building the newest and fancy is deep learning models, and so these like fundamental foundational skills are our overlook. If you look at the if you look at our current education system, your professional or academic, I think there's generally not a very big focus on teaching people the end to end data science workflow, and the end too. M workflow generally involves a lot of steps and almost always in the beginning is to you know ingest your data, do a bunch of data cleaning and only after them can you start doing analysis. And so how do you build your workflow in a way that you can have analysis ready data or as a hot we we can put it. How do you put your data into, you know, a tidy data set? It takes a lot of work and takes a lot of time and I think you know, schools and even these professional book camps generally kind of don't teach you that. They just give you preprocess data set and then they focus on like technique, they focus on...

...algorithm and you just kind of take that data as given and you start building some models on top of it. But that's not really how the data science works, and so I think it's very important to focus on on learning that. And so yeah, I have a lot of thoughts on that part. And I'm still kept trying to organize them. But I think it's very important to to get exposure to building data pipelines that will compute data for you on a recurring basis. Well, that's a great snake bake into into some some coming work, which I'm sure where everyone's excited about. Thanks and what else. I think it's important to get exposure early on to the different type or different facets of data science work, because this field is still growing and because this field is Nassan and new. We have a labor market where you have a bunch of people who believe there is a need, and I believe and I assure you there is a need for talent for data and there is. So that there's the demand side. That's the demand side, and you have people who are aspiring data scientists, who are trying to become professional data scientists. So there is the supply side and you have this labor market where supply is trying to meet the band and the man is trying to me supply, and that's a great thing. But I think the challenging thing is that sometimes companies don't necessarily know why they're hiring data scientists for and sometimes people who are aspired to become data scientists don't really know what are the right skills to build and so and I in my opinion, if you're a new data scientist, it's actually very important to not jump directly into areas which you think are data science. And most of the people, especially the ones that have graduated from university programs or even boot cams, are people who really, really want to do modeling. We really really want to do machine learning, really really want...

...to do deep learning, but really that's only a subset of what data science is really about. And so my recommendation is, instead of just jumping in or concluding that modeling, machine learning, deep learning is the only path for you to become a great data sciences I think it's good to eat a little bit more patient, sit back and observe what are the steps involved in building and working on end too and data science project. And then from there, I assure you you will learn a lot of things that are actually not taught in school, that are actually very, very relevant to your day to day work. And so you know, building etl pipelines is one example. One such an example, experimentation. Experiment designs is another example. I think these are all very important skills to learn, and so for someone who's early in their career, I think it's important to be patient and then to learn all the basics and it will and after then you can decide, okay, do I want to specialize? If I do, then I can. They dig in a little bit more and then learn and go more deeply, and I think that's not only is that good, a good advice, but you actually spoke to something which I think about a lot, which is the idea of data sciences being what you refer to as a nice and field, something that's that's being born as as we speak. So we're going to have to wrap up it in a minute, but we've been speaking about modern data science, particularly in in terms of digital businesses. I'm wondering what future data science looks like to you in the next five hundred and ten to twenty years, where this field will be going. Ah, that's a great question and it's almost like a prediction question, which exactly how you work in machine learning. Hey. Yeah, and, as you know, prediction is always hard, so it's hard to say. In my opinion, though, if I were to make one prediction, I think is that tooling around data science is evolving very, very fast and it's continue the I think the path of innovation is not going to stop. I think right now we're in a we're in a world...

...where sometimes even doing the simple stuff is really hard because of the scale the data and because of the knowledge that's required to command or use all these tools. It has improved a lot, but I think there's still a lot of rooms for improvement and people are working on it, and so I think in five or ten years a lot of these data cleaning and a lot of these etl process can potentially be greatly simplify and I think that would be a very exciting place to be, because then you can outsource a lot of your sort of eedigredy detail work to the machines. They can do it to you, do I do it for you, and you can spend more of your time thinking about a business problem, solving the right problem and a making your solution impactful for the domain that you're interested in. One, one, one specific example of this pattern is also in machine you also see that machine learnings, where there's this like big movement of Socall automail automachine automatic machine learning and you have companies like data robot, hto and a few others that are all kind of trying to work towards these worldware. They're trying to automate a lot of the kind of needy, greedy detail for you, or I should say at least they're trying to automize some part of of your workflow. And so I don't know. I think the world will be interesting when we get there. And I always debate, they have like interesting debates with my my coworkers on you know, when will when will we we be replaced by machines? And it's hard to say. I don't think we will ever be replaced by them, but I do think that the tools or evolve a lot, and so I'm looking forward to the data new tours are being developed in born. I'm also looking forward forward to that day. And so were as we move from modern data science into future data science, I'd like one final call to action...

...from you. What should we be doing? What should we be learning? My recommendation would be focus on the fundamentals and and then don't stop learning. And so if I want to pick one thing for all data scientists to learn, I would say learned sequel. I didn't know sequel until I started working, and so it's entirely possible to learn a sequel and you will know how. You will learn how important it is for all kinds of data science work. Thanks, Robert. It's been an absolute pleasure having you on the show. Thank you so much for having me. Thanks for joining our conversation with Robert Chang about the role of data science in business, at AIRBNB and in tech companies at large. Robert told us about the different types of data science work required at multiple stages of company development, from laying the data foundation to data science for growth to then machine learning for making the product or products smarter. We saw that at Airbnb right now there are even several products at different stages of development. We took a dive into concrete examples of data science Robert Does, such as estimating the lifetime value of properties on Airbnb, and Robert also broke down for us the different models of how data scientists a place within companies. Robert made a strong case for the hybrid model being his favorite. Make sure to check out our next episode, a conversation with Jake Vanderplus, a data scientist, astronomer, open source beast and renowned Pythonister who will join me to chat about data science, astronomy, the open source development world and the importance of interdisciplinary conversations to data science. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and data camp at data camp. You can find all our episodes and show notes at data Campcom community slash podcast.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)