DataFramed
DataFramed

Episode · 4 years ago

#10 Data Science, the Environment and MOOCs

ABOUT THIS EPISODE

Air pollution, the environment and data science: where do these intersect? Find out in this episode of DataFramed, in which Hugo speaks with Roger Peng, Professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, co-director of the Johns Hopkins Data Science Lab and co-founder of the Johns Hopkins Data Science Specialization. Join our discussion about data science, it's role in researching the environment and air pollution, massive open online courses for democratizing data science and much more.

In this episode of data framed at Data Camp Podcast, I'll be speaking with Roger Pen, professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health and Co founder of the Johns Hopkins Data Science Specialization. Roger is also a wellseasoned podcaster on not so standard deviations and the effort report. Today we'll talk about data science, its role in researching the environment and air pollution, massive open online courses for democratizing data science and much more. I'm Hugo bound Anderson, a data scientist at data camp and this is data food. Welcome to data framed, a weekly data camp podcast exploring what data science looks like on the ground for working data scientists and what problems it consolt. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and data camp at data camp. You can find all our episodes and show notes at data Campcom community slash podcast. Hey Roger, and welcome to data frame. Thanks for having me. Such a pleasure to have you on the show and I'm really excited to have you here to talk about data science, the environment, massive open online courses, the are ecosystem. But before all of that, I'd like to find out a bit about you. What are you known for in the data science community? This is a you should know. This is a horrible question, right because absolutely. That's why I open with what part of the private that I feel like I've been doing this for so long that like different segments know me for different for various things. So it's kind of hard to pin down a cup. You know one or two things, but let's go through. I think you know some people. Most people probably know me because I work at Johns Hopkins. I'm a professor of biostatistics there. I've been there for about twelve years now so and yeah, I do a lot of titch statistics for search and and kind of teaching there more recently. I think people well know me for kind of I've been in our for a long time. I'm using are for over almost twenty years now. I've been a user, I've been a developer, I have...

I used to contribute a lot to you know, back when they had mailing lists. I mean I think they still do, but it's not as active. And now I teach these kind of massed up open online courses and data science, which are heavily focused on our and I think a lot of people have been exposed to me in a kind of in that manner, watching all of my videos and such, and then even more recently, I think I've been doing some podcasts. I've won podcast with Hillary Parker, who's that stitch fix. It's called not so standard deviations, and so I'm always obviously I'm always excited to be on a podcast and I run a blog called simply statistics that I do with Jeff Leek and Raphael Erisari. So, depending on kind of where you entered by life, you might know me for one of those three things. And I think the first time we spoke I told you that I actually took your mook in the early months of January, twenty thirteen before the specialization I actually existed. Yeah, and knows my first interaction with Ah. Yeah, I think you got version of zero point one of that. Actually, it wasn't. It was incredible. Well, and this many years later on, I'm at data camp. So it's always very gratifying to hear people say that. It's just exactly kind of what we want as many people as we can to kind of be exposed to our and kind of learned data science. So it's great to hear that. So how did you get into data science initially? Oh, here you're going back quite a bit. I mean I I usually majored in math at university and and I, you know, I'd be honest, I wasn't. I'm not dest I was not destined to be a mathematician. But as part of the math major you had to take some statistics courses and probability courses and and I really enjoyed those and so I although I could say I can't say I did very well in those courses, but I really enjoyed them and so I kind of decided I was going to pursue that. I briefly thought about me and be being a software engineer after college, but decided I didn't want to do that. So I just thought I was going to go to graduate school and do a Ph d in statistics, and so I went to Ucla to do my PhD and and that...

...department, at the time, it was a little unusual, is very applied, very data oriented, very data analysis oriented department, and so I had a huge influence on me in terms of my education, in terms of my philosophy, and so that kind of kind of the roots of it all and I think I've always kind of enjoyed data science and kind of data analysis, you know, from the from the get go. So that's kind of where things got started of thing. Yeah, and it sounds like during that time you were doing data science, just that term didn't necessarily exist. Well, there are those that would argue that. I think there is. The data science is just all those things, but nevertheless, I think I think I've been is very influenced by kind of the people that I learned from it Ucla, who are kind of had this very sense of, you know, you need to kind of take ownership of a whole problem, to think about the questions being asked and to kind of how the data is collected and kind of kind of understand the entire process by which data is analyzed. So I think that has evolved now into this huge area of data science and machine learning. I love that you say you're not destined to be mathematicianal you weren't destined because my I went to Grad School in pure math and I feel exactly the same about myself these days. So yeah, I mean I took real and I took real analysis three times. So all right, I think I got it on the third time. How many times did you take complex analysis or that can be for another for another conversation. Yeah, I think so. I'm really interested that you you worked in in software engineering as well before moving into biostatistics or concurrently. What type of role do you think your software engineering skills have played in what you're interested in and how you approach your work now? Well, I think you know, it played a big role in my kind of career development because I think you know at but when I first started out in statistics, the the the idea that you it would be useful to have knowledge of software engineering was very unusual. Not many people had any training in kind of software engineering ideas or principles or practices. And of course it selfengineering...

...practice to have evolved quite a bit over time, but even but there were kind of, you know, established kind of best practice and things like that for developing software even, you know, in the kind of late s early s when I started out, and and many, very few people in statistics kind of had that kind of training. So I thought I was kind of a point of it was a as you would maybe say a differentiating factor for me in terms of my ability to kind of develop software and kind of think about how things should be built and kind of to use that to kind of build tools. I think things are quite different now there's all the the the the kind of ideas of software development and selfare engineering are permeate data science now and and I think as a good thing. And I think it's and if you look at things like the tiy verse and in our and all these other kind of nude recent developments, a lot of that is infused with the ideas of software engineering and software development and I think has changed the way we do data analysis for the better. So now you work in environmental biostatistics and you research the health effix of air pollution and climate change, for example, one of the major challenges facing these fields. So I think the one of the biggest challenges from the perspective of a statistician is that the signal, the signal to noise ratio in this kind of problem is relatively low. So air pollution, I think, is generally understood to be a harmful thing, harmful and our mental exposure, but it's not something that just knocks you over dead, you know, in the middle of the street. The connection between air pollution exposures and health outcomes is inherently kind of weak, right, but because everybody is exposed, everybody has to breathe and every you know, and so that there's a lot of kind of this huge population exposure, it's a really important problem for, you know, for everyone across the world. So I think the inherent kind of weak signal in this kind of problem is makes it interesting to statisticians and it what it requires then, is, in some areas, you know, gathering huge data sets. In it the work a lot of the work that I've done, we've gotten huge administrative databases for health outcomes,...

...looking at large networks of air pollution monitors, things like that and trying to link them together. So there is a huge kind of complex data management, data integration problem just to be able to kind of get set up to answer these this kind of question. In terms of the end game of these type of research, do you need to deal with with policy MICAS and legislators as well? Is Part of your work, I suppose. Is what I'm asking to communicate these technical results where you've extracted the signal to effect policy yeah, so that's a big part of the kind of of the endpoint of all this work, and especially in in many countries, is that where they have environmental regulation, like in the United States, the policies are typically informed by the kind of the the latest and the best scientific evidence, and so in the United States, for example, the Clean Air Act, the regulations that kind of govern in the air pollution standards specifically say that they have to be informed by science, and so allow of the work that we do is where, after we publish them in the journals, you know, they get fed into the evidence space that is developed by regulatory agencies and we do a lot of discussion. We do a lot of talking to environmental agencies public policy makers to understand kind of what is the margin of safety for kind of regulation and from are for air pollution exposure, and what type of specific questions do you look at in your research? So I'm I'm interested in kind of two facets of air pollution resources. One is kind of the outdoor air pollution, so you know, a pollution that you typically see outside of it's hazy, and another kind of air pollution is indoor air pollution, which might be in people's homes and that's of a very different nature. So I'm interested in right now. I'm inntioning kind of the how air pollution, you know, it's a complex mixture of many different chemicals, and understanding kind of how that the chemical mixture or the nature of that pollution can be more or less harmful, because if we understand that a little bit better, we can understand how to intervene on it and how to kind of control the sources of pollution to kind of make it less harmful to human health. So that's one area that I'm mission right now. Another area dimension is in the indoor environment, you know, where...

...we have vulnerable people like children or the elderly who are exposed to things like dust, too, allergens and all kinds of nasty things that could be in the home. And Similarly, as the question there, how can we intervene to modify the home environment to improve people's morbidity and to kind of improve of health for people who spend time in the home, which can be a large percentage of their time in many cases, and I presume this is actually quite difficult because a lot of the time in scientific were search we actually want to do experiments. Right, right, exactly, and I think one of the challenges with outdoor air pollution is that you can't just you know, they're not there are very few controlled experiments that you can do where you can, say, modify the level of pollution and see how people respond. I'm it just doesn't really work that way. Sometimes you can, you get lucky and there's like a power plant shuts down by accident or or maybe one city has the Olympics and so they do a traffic restriction or something like that, and then you can kind of observe almost as if it were an experiment. But for the most part you have to do you have to understand that there are lots of confounding factors and the modeling has to kind of account for that and it can be a much more messy picture. And do researchers in your discipline need to jump onto opportunities like that if one of those things happens, for example hosting the Olympics? Yeah, so the actually the I mean the Olympics is one example of like a it's a it's a it's a regular occurrence now that I think the last five or six Olympics there's been an air pollution study in the in the city where it occurred, and so it's things like that are with that are planned that you know many years of advance, are great kind of opportunities, but other things, too often are sometimes our opportunity. Sometimes, you know there's environmental control regulation, but it doesn't get implemented at the same time in every location. So you can kind of see look at differences there. Sometimes there's there's like a power plant that gets shut the if your people go and strike and so the powerplant shuts down, for example, and so there's there are opportunities for these kinds of natural experiments and you do kind of have to jump on them when they occur. So you've spoken to a number of issues such as data being very complex and messy, the signal to...

...noise ratio being being low and the difficulty of doing controlled experiments. I'm wondering how data science can help us solve these challenges. Well, I think there are a lot of opportunities to for data scientists to kind of work in this here. One of the Nice things about environmental research is that a lot of the data is public and and and so the accessibility of the data is extremely high. And I think there's a lot of opportunities for for for data scientists to kind of look at specific questions in certain areas around the world. Often the data is collected by governments and so that's they may they make. They will typically some places better than others, make it publicly available, and so the network, the monitoring networks and things like that. Or are you the data is usually available in the United States is very detailed, kind of very interesting information about monitoring a pollutants across the country. So I think there are a lot of there's a lot of data that's basically just sitting there. I people like me work on it, but you know, there's not that many people in that necessarily all a ton of people in the academic world looking at this and I think there's a lot of opportunities for data sciences to kind of take the data and to answer kind of relevant questions for whatever area they're in. How much domain expertise would a data scientists need to have in order to think about these types of questions? Because sure that the data is are open, but a the techniques and methodologies highly sophisticated? Or do you think Datastinn's working other fields can come in to think about least starts of questions. Well, you know that kind of question. It really depends on what what what you're trying to answer. I think there are kind of the vay levels of questions and and some require a lot of domain knowledge and some don't. I mean, for example, if you look at it, there's a kind of a love interested in kind of citizen science type of work where people want to know, you know, what's the air pollution, you know, near my house, or what's the air pollution in my neighborhood or in my town right and I don't think it doesn't require a PhD in physics to get, you know, a the monitoring data new kind of nearer where you live and see what the levels are and looking at trends over time and and looking at monitors maybe in the neighboring town or you know, and things like that. So...

I think there's a lot of different kinds of questions that can be asked and depending on how details or how kind of complex that question is, you know, require certain types of knowledge. But I think there's a lot of work that can be done by having a kind of a basic understanding of how data to how the day is collected, how the monitoring works and what pollutants what the pollutants are, which can be allow. Is a lot of information, for example, in the US on kind of vpa website that's that can be kind of gathered there. So, and I presume that a prediction is a is a huge thing that you need to be doing when thinking about the effects of a pollution. Absolutely in fact, that's one of the more kind of modern kind of innovations in air pollution work, because in the past we've always relied on kind of the raw data and monitoring locations, which is nice because that's absurd and the kind of it send to be high quality data. But we don't have monitors everywhere in the in the entire country. We typically have a few scattered across each city. So the development of prediction models for pollution exposure has been a huge kind of benefit to the community because now we can develop these types of models that make predictions of air pollution exposure pretty much anywhere in the country and of course there's uncertainty associated with that, but it allows us to kind of do much larger studies and much more comprehensive have much more comprehensive coverage across the US and a lot of these models are interesting because they integrate lots of different kinds of data. There's just monitoring data, their satellite data, there's all kinds of land you stated that we can so these. So the integration of all these different data sets into these large prediction models has really been a huge kind of benefit to the fields of air pollution research. And this is something you spoke too earlier when you say to that data management can be a huge challenge. Absolutely, because I think you know you're looking at all kinds of formats. You you got G GIS data, you've got monitoring data, everything is spatially kind of at different scales, and so you have to integrate things at different spatial scales. It's a huge complex task and the people who can do it are you know, it's a it's a requires quite a few quite a bit of technical skill and understanding the different data formats and things like that, I'm sure. And when you talked about using these types of prediction models, without getting too technical, could you let us know...

...what type of techniques you you use or your collaborators or fellow research is use? Yeah, I mean I think there is a bit of a range terms of what how people approach these kinds of problem. You got people doing neural networks to do produce pollution exposure or or other people are doing kind of which is very complex, regression models, and so it's it's a little bit. Everyone's kind of got their favorite tool. But the one thing I've found is that the you know, the qualite that the tool that we use is, it's not a major source of variation in terms of the quality of the models, and the reason is because everyone's kind of using the same data. Everyone uses the same monitoring data, everyone uses the same satellite data and most people use the same kind of land used, land use and land coverage data. And so the because the data sources are almost identical across the different models, the the variation in the accuracy of the models is relatively small, even though people are using different techniques. And you also used, with respect to prediction in particular, a word which is, I think, one of the most important and undervalued words in the modern data science and statistical landscape, which is uncertainty. Yes, and particularly with respect to Communi you make a prediction right, and you you really, I mean a lot of people will give point estimates and not entire distributions or confidence intervals with respect to what type what you're actually predicting. So how do you navigate the idea of suppose in essence communicating uncertainty to stakeholders in whatever question you're thinking about? Yeah, it's a critical, I mean point to to recognize the uncertainty in in a lot of these modeling processes and I think it, frankly, it's still an active area of research in terms of how do you integrate the uncertainty of these pred of these exposure models into the ultimate output, which is the kind of the assessment of risk from air pollution exposure? And there are, you know, there are variety of ways to think about that. One is just a traditional competence interval which incorporates both the kind of statistical uncertainty of having just a finite data set as well as the modeling uncertainty of having to predict exposure at various places to...

...how to incorporate so incorporating those two sources of uncertainty into a single competence interval. That kind of is one approach, but there are variety of it. There are still kind of different ways that are being thought of in terms of how to do this and what's the best modeling approach. So I would say it's still an activary research. There's no final answers, you know, yet. Yeah, absolutely, and to some tools you find and techniques help you describe uncertainty in a more communicable way. For example, Bayesian inference could allows people to give distributions as opposed to just just confidence intervals. Yeah, I'm I think various modeling techniques have their pluses and has. I mean Basian techniques, as you mentioned. Are is one nice basically, are there one nice approach to kind of incorporate, often to incorporate the many, many sources of uncertainty that and then you can kind of plot distributions, you can look at kind of these posterior distributions. One downside, of course, of that approach is that you have to have this grand unified model and a lot of the results are dependent on that model being more or less correct, and so there's other kinds of it. There other issues that can be kind of brought in wave use certain techniques, and so you know, there's no perfect solution and and some may be better for some situations than others. Let's now get into a segment called Stack overflow diaries with Kara Woo, member of the our stats community and contributor to gg plot to. What's up you go? Do you know the Python Library seaborn? Know it, I love it and I use it daily. So, for those of you who don't know, seaborn is a library for visualizing data and Python. Conventionally, when you're importing the seaborn library, people will write imports seaborn as sns. This imports the functionality of seaborn so that you can use it and aliases it to Sns so that you don't have to type out the whole word seaborn every time you use a function from the seaborn library. That's right. aliasing in Python is both common and encouraged by many. IMPORTING PANDAS HAS PD and Numpis MP other typical examples.

Yeah, this isn't really done in our but it's super common in python anyway. Have you ever wondered why people use sns as the alias for seaborn? It seems like something like spn would be more natural. Yeah, I was always curious as to why it was sns. Well, stack overflow user Lucas pose this question, which I heard about from twitter user command line tips. It turns out that SNS are the initials of the character Sam Seaborn, middle named Norman, from the TV show the West Wing. So sns is an inside joke that now pervades practically everyone's use of seaborn. HAH. I have to admit I've never seen the West Wing, but do you think Sam seborn would mind? I don't know what Sam seborn would have to say about this, but he does crack a statistics joke in at least one episode, so I'm guessing he'd be pleased. So Michael Wescombe, the creator of Seaborn, must be a die hard west wing Fan right to include a joke like this. Yeah, I think so. He's written several other python libraries named after West Wing Characters, like Ziegler and Lyman. Liman is a package for analyzing neuroimaging data in Python and Ziegler is a web at for reporting results from Liman. Well, there you have it, folks. Thanks once again, Cara, for reading us a page from your stack overflow diaries. Always a pleasure. Hugo. After that interlude, it's time to jump back into our chat with Roger. I want to pivot slightly in this conversation because we're talking about as we discussed, these types of questions are really important to policy makers and legislators and there are lots of results coming from all directions in in the research landscape. And I know that a topic deal to your heart is the is reproducibility in science. I'm wondering if this is we hear that there's a re reproducibility crisis in science a lot of the time and I'm wondering if this is something you see in your field of research. Well, I think one of the benefits downsides of being in the kind of air pollution area is that it...

...is a in terms of the kind of impacts of the research, is a reasonably kind of high stakes area in the sense that lots of people do care about the topic, lots of people care about the results of the findings that you come out with, and so the work tends to have quite a bit of scrutiny, and so that, I think, ultimately is good thing. First of all obviously people care, which is good, and also because there's a lot of scrutiny in the work there's a IT forces you to be rigorous in your thinking and to be transparent in the process, and so I think the idea of reproducibility in within with air pollution of work and environmental work more generally is not unfamiliar in the sense that the work already, from the Getto, has had a lot of scrutiny. And I think that the other issue about the work is that because it's largely an observational science and we don't do a lot of controlled experiments, there is a mentality in the kind of epidemiology area that is basically that we don't trust any results unless they've been replicated in a number of times. It just because any observational study is going to have some sort of limiting qualt factor, limiting kind of features, and so you have to replicate things over and over and over again before you really kind of want to stake any sort of claim on them, and so I think the natural skepticism of people in observational sciences is is useful because it prevents us from jumping on the latest hot finding and saying this is the this is the latest thing, this is the smoking gun or whatever, and there's a certain amount of patients. That's built into the field. That requires certain interesting findings to be replicated, if least a few times before, you know, we accept them as a fact. Yeah, and it sounds like a really important part of this is establishing a conversation, not just in publishing journal John a articles, but actually talking with everyone is part of a community. Yeah, absolutely, and kind of involving more than just the academic researchers. I mean it just constant discussion with policy makers and and researchers and, admit, in the various different stakeholders kind of in the area large. This brings me to another...

...point, which is that you love to talk about data science, and you know this is all. This was obvious in two thousand and thirteen when I when I took your your mood, and it's obvious when listening to your podcast with with Hilary, and it's you're both of your wonderfully contagious. And I'm wondering what the contagious in a positive way? What? What is the role of conversation and communication in data science at large to your mind? Well, first of all, I'm at it might just be that I love to talk. I don't know, about data sides. But I think, I mean, I think communication, and in particular I think verbal communication is very important in data science. I think it's data analysis. I fact, I have found to be a, if I I kind of highly verbal endeavor and I think when you talk about what you've done, even just a one person, doesn't have to be some group or anything. When you talk about what you've done and how you've done it, the it engages a process in your brain of kind of a different kind of analytic process in your brain that lets you think about what you've done and criticized as it and I think when you don't have those kinds of opportunities to communicate what you've done, you can still do good work, but it's a very different process, I think. And so, for example, in I found that, you know, in Johns Hopkins, when we teach students, you know, it's very important that we give them these venues to present things and to communicate what they've done and to describe what they've done and defend to you know, to think about it, because I think the act of talking and the act of kind of presenting more generally, is a different kind of analytic process and allows you to understand. You know, am I doing something that makes sense? Does it make sense to other people? And so I think it's a critical part just to in your own kind of thought process. The other thing that's important, I think, with communication and and is just kind of is being able to understand the person or the people that you're communicating to. And so I think part of a successful data analysis or data the data science project is understanding what the audience already knows, what they don't know, and presenting to them something that they will...

...find interesting and they will find useful. Absolutely, and an understanding that you need to develop or find common ground in the common language. I think right, and that's very dependent on the audience. I mean I think there's a tendency to think that an analysis can stand alone and it can be either good or bad by itself, but I think a lot of the quality of analysis depends on the analysts understanding of the audience and and presenting it in an appropriate way. So what does the future of Ditosians INSTATUTEIC Lookleck to you? Do you want that in one sentence or two? I'll give you three. Okay. Well, it's like a dissertation by exactly we've got a committee. I think it's you know, it's very it depends on who you are. It's I think it's very bright if you're going to be if you're entering this field. I mean, I think the the incorporation of data into decisionmaking in all areas of work, whether it's business, government, academia, whatever, is is still is still kind of increasing and I think people still understanding its value. I think a lot of kind of the the kind of low hanging fruit type stuff has been, is and sometimes gone, and so, you know, the the kind of money ball scenarios where there's lots of we're just kind of obvious inefficiencies that can be taken advantage of by using data are kind of quickly going away. But nevertheless, I think the importance of using data decisionmaking and and more partly of collecting data, is go is still increasing and I think over time there's going to be we're going to see a greater emphasis on how we collect data, the quality of that data and how we the nature of we asked, of asking a certain type of question can affect the data. I think the early stages of this kind of trend have been focused on you know all this days. There's all this data that's out there and we should just, quote unquote, use it. But I think as we go forward, they we greater emphasis on kind of thinking about how we collect the data and spending more time on kind of asking the right question. Yeah,...

...and how we document the data and how we how we think about date, data, lineage and all of these things, and data ethics as well. Yeah, I think there's a I think there's a kind of a very large conversation that needs to be had going forward on what are the limits of what we can do with data and and what are the kind of what are the kind of ethical guidelines that we need to kind of sent that we need to kind of agree on as a community, as a society. That, I think largely has not occurred, but it will have to at some point. And so you mentioned something which I which I've been very intriguing, the incorporation of data into decisionmaking, and we see this happening at more and more levels in in society and and in businesses. But part of this going forward will involve people understanding data better or speaking the language of data. And I'm suppose I'm really speaking to a form of data literacy or data fluency there, which a lot of people think is a skill or a form of knowledge reserve for the few, and I was wondering how you feel about that. Well, I don't think it's something that's for the few. I think there are a variety of levels at which we can think about data literacy and and not everyone needs to be able to fit a machine learning model. I think there could be a greater emphasis on training, particularly at the maybe at the earlier ages of education, on thinking about evidence, thinking about the interpretation of evidence. And this can be done at a very simple level or can be done at a very complex level, and I think and it should be done at all those levels and it so I think there's going to be an increase need to kind of understand kind of what, what kind of what qualifies as evidence? How is data being used and and how do we separate it's kind of the interpretation of that in terms of the decisionmaking from kind of what the what is the evidence in data? So I think we don't necessarily get a lot of education or training in that at the moment, as in the earlier eight stages, but I think it will eventually kind of infiltrate down in is to become a kind of a core skill like reading and writing in those kinds of things.

I think so. And presumably one one part, one place it could occupies in a math, math curriculum at high school. It could be in a lot of other curriculum as well. But I honestly feel there are a very strong arguments for replacing certain parts of math curricula, such as calculus. I mean the amount of integrals that I had to perform at an early age shoot could be replaced with with learning about data and and the basics of probability in this type of stuff. Why would you want to get rid of integrals? You did enjoy that, but I loved it. But the amount that I that I had to do was I think it's bet I mean, you know, if I had become a physicist or in engineer, it would have been incredibly important. But I think the the strength of the emphasis on calculus at high school is perhaps unnecessary these days. I think. You know, I think the the advantage of kind of a building in data science type education at an early stage is that it allows, I think it gives a people an opportunity to kind of do something, to create something at an early stage. That's kind of interesting in it I think it's interesting in a very different way than mathematics is. I think interesting mathematics is. It's something is interesting almost like, you know, like an art or, you know, in terms of or in terms of its kind of a beauty. But I think the data science is interesting in a way that allows people to kind of create things, to build things, and it can be very attractive. And if, when to teach people data science or data analysis or thinking in terms of data, data literacy, data fluency, today, we would teach them programming skills as well. Right. Yeah, I think obviously, I think programming skills are important to the kind of active data analysis. It's not a hundred percent clear to me at what point it's essential to introduce that, because one of the things that I'm kind of interested in now is, you know, I don't think we have a I don't think we have a perfect system for teaching the art of data analysis yet, because I don't think that we don't...

...have a lot of we don't have a really I don't think we have a great formal framework for teaching people data analysis and for understanding what what makes for successful data analysis and what kind of makes for for failures. You know, and I think because a lot of what we do with data analysis and now is essentially in terms of training, is, you know, here, just watch me while I analyze the data and learn from kind of the things that I do, and that's how we, for example, teach PhD students. And it's kind of a fuzz the thing we don't have a great formal framework for saying here are that the components of a data analysis and what to do and what not to do. I think that requires a bit more understanding, and I think it does. It's something that's independent of, say, programming languages and tools. So what does make for good data analysis in your mind? That's a quite honestly, that's a question that it's it's difficult to answer, and I think one of the reasons is because it's not I don't think it's an it's something that is inherent to a data analysis. And the key problem, of course, is that there are always people involved in data analysis, right. People are always the problem, right, and I think there are people who ask the questions, there are people who kind of receive the answers and the successful of an analysis depends on the people involved, and so it's I think one of the thing you talked about the future of data science. I think one of the issues that I think will becoming that will come up in the future is that there's a lot of interest in kind of automation and kind of mechanizing data analysis and I think there's there's a thought that a lot of money to be made in that and doing so. But I think fundamentally data analysis is this kind of human enterprise and people come up with the questions and people come up with people act on the answers, and so it's difficult to say what makes to kind of universally say that an that I given data analysis is successful or not, because it depends on all these kind of human elements. So then how would you go about teaching good, good data analysis? Well, I think so. I think you have to think about well, what are the basic principles that people rely on to say okay,...

...this? For example, if someone says this person is a great data analyst, what do they mean by that? You know, what are the qualities of that person behavior that make them great at versus say, another person and I think a big component, and there's certainly others, is that if someone does a day analysis and you trust that the analysis is well done, then that person is a good data analysis. So this is sense of which you trust that person, you trust the work that was done and you trust that there aren't any hidden elements that you can't see in the data. And so anything that a that an analyst does to say to get you to trust that the results I think makes for a good data analysis. And so building that kind of trust is really important and I think probably communication and different types of language is one way to build this type of trust. Right. Absolutely. I think if people are receiving kind of results of an analysis and they're communicated in a way that they understand, using language that they understand, they will accept it more willingly, I think, then if you just kind of throw a bunch of jargon at them. So we spoke briefly about how, when actually doing diat analysis or teaching practical data analysis, the modern way to do it is using using programming languages. I suppose this is somewhat of a loaded question, but what programming language would you choose to teach? Well, there's so many choices out there, but I mean, I think it depends a lot of what you're trying to do. If you're, I think, if you're trying to do data analysis, I think, I still think that our is a great language for that. I think it's designed with that in mind. But there are other there are other great languages, I think, for for other purposes and prettyular. I think if you're in trying to integrate with the with other with lots of other tools, tools, you know, languages like Python might be better for that. It depends on the environment, it depends on kind of what you're trying to do and I don't think that it's an either or type of question. But of course, people are limited most both. I'm limited, I know, in terms of what I can do at any given time, and so I can't focus on five different languages at once. But I do think that ours a great language for Datatalysis and it has only gotten greater over time terms of the community, in terms of the...

...packages available and the capabilities. Absolutely the communities incredible. Also the ability to write documents with with it as well to write, God, data analysis narratives. Absolutely that has changed dramatically since I've been using our and it's in obviously for the better, and I think the ability to do reproducible work, to kind of write documents where there's code and there's data and there's results. It's just it's fantastic a feature of our let's now jump into a segment called rich, famous and popular with Greg Wilson, who wrangles instructed training at data camp. Hi Greg Today. What do you have for us today, Greg, well, to be honest, what I have to do is a little bit of heresy. I believe that ten years from now most people who are doing data science will be doing it in Javascript or in some strongly type rivetive like type script. Are you serious? Yep, I don't think our in python will go away or that the people that are using them now will wake up when dance which languages. What I think is going to happen instead is that as people who are already programming start doing more statistics and data analytics, they're going to choose to do it in the language they already know. And these days, no matter what else programmers use, they eventually have to learn javascript. Once you've been show outside scientific computing and look at the other ninety nine percent of programming. It's everyone's other language or even their first language. I simply can't see a future in which those programmers choose a single purpose, immature language like Julia that doesn't have, if all, the nonnumerical libraries they need, over one they already have to master. Agree websites or mobile applications or even server side applications, that already has a ga Zel in libraries, and with major players like Microsoft, Google and facebook all working hard to make general purpose javascript faster, it's going to be more and more difficult for niche players to keep up. But javascript doesn't really have any numerical libraries, does it?...

No, but that would actually now be relatively straightforward to fix. No, Dutch as is easily the most popular desktopic ecosystem for javascript development and it has a pretty clean story for integrating external libraries written in other languages. I think would only take a few weeks to take the core of Numpie and make it callable from javascript. That would just be a case. I mean you'd have to build the equivalent of SCIPI and you'd probably want a few syntax extensions for array indexing, but we're going to have to do all of that anyway, for any new language, plus whatever is developed now for the desktop will be able to run within a year in the browser using web assembly. So do you really think this is going to happen? Hell if I know I've been wrong much more often than I've been right, but when I first got involved in high performance computing back in the s, the landscape was completely dominated by special purpose hardware, and if you'd asked me I would have said that would always be the case. We simply didn't realize that for every smart person at cray or thinking machines, there were a hundred or a thousand smart people at Intel focused on making general purpose hardware blazing fast. I think the same as now true of languages. Thank you very much, Greg. If anyone in the audience is interested in telling Greg why he's wrong, please get in touch. Would love to hear from you. Thanks, Greg, and looking forward to speaking with you again. Thanks Hugo. Time to get straight back into our chat with Roger Pain. Some people criticize off and not being a real programming language, and I want you to tell me why they're wrong. So that's an interesting common actually it's something I've thought about a little bit recently. And if you look at the history of kind of our development, I think a lot of the early history of our was driven by this one sentiment that are is not a real programming language. And so a lot of the development of our in say, the early two thousands, I think, was oriented towards how can we make our more like a real programming language, a...

...more like, I don't know, see or C plus plus or something like that, right, and I think what's happened and now is that I think it's interesting to see, in my opinion. You know, that kind of idea has has been basically forgotten, not forgotten, but abandoned. Basically, it's that we're not why bother trying to make our like a, quote unquote, real programming language? Why don't we make it a language that's great for data analysis? Right, and that is like a very different mindset, I think, and I think that's the kind of thinking that leads to things like the tidy verse and to things like nonstandary evaluation in terms and so how do we make this a great language for people who want to analyze data really efficiently? And I think that has oriented the our language in a very different direction and rather than being focused on making it like a quote unquote, real programming language, that we can just have a different set of goals. And so I think my, my kind of answer to your question is basically to avoid it mis say, like who cares if it's like a real programming language or not, but what matters now is like, is it a good programming life? Is it a good language for doing data analysis? I actually love that because it does not only not answer the question, but it says, it tells me that I'm asking the wrong question. Or whoever makes these statements making statements that actually inconsequential to the to the task got our is our is there to do it and it's changing the conversation that way. Yeah, I'm that that just try to die of the question. I do think that that was like a way more relevant question maybe ten years ago. Yeah, yeah, and your answer is way more interest an answer today, I think. No, I mean that. I'm not being facetious at all. Yeah, it's just interesting to see how things have really evolved over time. It's just fascinating. As we've discussed, you've done. You're known for a wild array of things. I'd almost say you're a jack of all trades and a master of many, from from working in research as a professor of Biostatistics, from your working in Moos and being a podcaster and and and a blogger. And I suppose I want to know kind of your approach for getting, just forgetting all...

...this stuff, stuff done. And I'm actually reminded of a story that I think you, your colleague and Collaborator Jeff Leak, told, and I'm going to get it horribly wrong, but I'll say it anyway. So one person walks up to a fence and it's they can't climb over it, but they want to get to the other side but they can't, so they just walk away. Another person walks up to the same fence and they're like, okay, I need a ladder. They get a ladder, it's actually too short, so they walk away. And then the third person, which could be yourself or Jeff, walks up and looks at the fence and realizes they want to climb over it and so they throw their head over it and then they say, okay, now I actually have to figure out how to get to the other side. Right there's a forcing funk and yeah, exactly. So after that long winded narrative. How does this relate to your philosophy of just getting stuff done? Well, you know, I think there's a kind of a way to say this that makes its base. It's kind of negative, which is that I think there's a lot especially in academia, there's a lot of kind of what you might call gold plating that happens, which is that, you know, we got to get things perfect, got to be everything's got to be absolutely right, and I think in science there's a lot of that's that can be a good thing. I mean, you don't want to, you want to don't want to pretty shoddy research, but I think, you know, there is something to be said about getting something out there so that people can see it, people can use it, they can play with it, they can give you feedback on it and then you can kind of iterate again. And I think one of the things that I'm grateful for and work with, working to my with my colleagues, you know, Jeff League and Brian Kafo and many others, is that there's kind of an error. They are on the side of let's put something out there, let's give people, let's let people kind of see what we've got and use will be have and and see how they react. Is sometimes people don't care and then that's fine and something say really do care and then we can iterate again. Getting things out the door is just a philosophy that we've kind of taken on aggressively, I would say, and not kind of dwelling on is everything perfect, is everything just right, and I think that, you know, has plusses.

Of is obviously it's not. It's not like the best approach, but allows us to kind of play with a lot of different things and I think are my personality likes to kind of do lots of different things and to see what works and to try to learn new things as we go, and so that's kind of I mean. I think, you know, I enjoyed like learning about video editing, you know well, when doing these mooks and with podcasting on learning about audio, and I think it allows me to learn lots of different things and that gets me excited and gives me motivated to do it, and so and so not as so, I think, and putting things out there and seeing people's reaction to them is really important. It's a very real learning experience, and so that kind of helps the kind of move things along. I think I'd love to hear more about the reactions to the Moose because I'm sure by this point you actually have so many data points. How many people have you had through the Moos, I think, I mean, and there's a number of people enrolled through one of our courses. It's upwards of maybe five or six million at this point. I think it's incredible. And and thousands of people who have kind of completed the whole sequence. So it's you know, I think in terms of reactions, it's moralists. If you can think of a possible reaction we've gotten, it's this and that's the some sets the beauty of the moos. I mean, I think we have such a huge spectrum of people taking these courses and it's it's a little bit all and the reactions kind of all over the map. Is Great for some people, it's not great, for some people it's you know, people enjoy it, some people like we've gotten all kinds of kind of comments about our voices and our videos and the quality of the audio and things like that, and some people like Jeff's voice and some people like my voice. You know, it's just it's just it's kind of sometimes kind of humorous. Let's see the the feedback. But yeah, so, I mean it's it. It's kind of all over the place that I think it's one of the challenging things that we learned early on is not to react to every single comment, right, because I think if you react to every single thing you'll just be, you know, hit back and forth like a hockey puck, you know, and so you have to kind of sit back and wait and kind of digest everything that comes in and see, okay, what are the major focus is,...

...what are the major issues that we can and just kind of distilled that a little bit, because with these so many people taking the courses that the feedback is just all over the place. Absolutely, we see something similar data camp, but I've I've had to become slightly more robust with respect my sensitivities concerning what people say about me. My old time favorite, though, is someone who, because I have a big beard, someone's someone said Hugo turns data science into game of thrones. Yeah, it is my favorite. What have you is. Are there any lessons you've learned about actual teaching practices or about learning practices, about, you know, the types of mistakes people make when learning, our learning, data analysis or anything along these lines. Yeah, I'man I think we learned a lot about teaching our and and how, frankly, it's kind of a how annoying it can be to teacher are and I think you know that a lot of the recent developments in our have have changed to kind of make that better. And I think one of the thing I personally learned is importance of kind of getting people to kind of do something that's like, not necessarily world changing, but it's very concrete and very kind of satisfying early on in the process. I think, whether that's, you know, making a plot or building a package or, you know, building some APP or something like that, you know, just getting something, getting people some sort of like okay, I can do a little thing and therefore I can do the big thing, but if you just go for the big thing right off the bat, it can be frustrating for people, I think, and so it's that's one thing I learned. The kind of staging and incrementalism in the on in the Moose has something that I learned and I think even I think it's very different. Also, the other thing I learned is very different from teaching in person in class, where you doing a live lecture and that kind of thing. I don't think there's one that's better or the other. I it's just such a different skill and the one of the things I often talk about is like, you know, the teaching in class is kind of like the theater and teaching in on a mook is like television, you know, and it's not just because of the literal kind of thing, but it's like the Mook is very small scale. You can do things on a much more finer Lu grain, whereas like teaching in the lectures is like...

...it's like, you know, everything is a little bit bigger and longer and you know it's an hour and a half, things like that. And so this just kind of the skills in terms of teaching a class on one sitting risk the other. I had we had to kind of learned very when yeah, and you mentioned that I can be difficult to teach and difficult for people to learn, and you alluded to the fact that this is changing in a certain number of respects. Is Part of this Jew to the tidy fust? It absolutely is, and also I think it's but generally speaking, I think there's a recognition that not everyone coming into our is a programmer and so the idea that we're just going to use programming concepts to teach our is not sufficient, because not everyone coming in is going to have that kind of background, baseline skill, and so I think the tidy verse has put addressed that concern by kind of making it, as I said before, kind of a more of a data analysis language. And I think you know, when we started the mooks on course era it was in kind two thousand and two thousand and thirteen, and a lot of the tidy verse had not yet been written, and so it's we didn't incorporate a lot of it in the courses and it's just funny to see now. I mean I think people were only walk of you not talking about this thing that had been developed yet, and so it's we've tried to incorporate it in future iterations or later iterations, but you know, at the time it wasn't really available. So it was in fact, I think, harder to teach earlier on than it was then. I think if we were to do everything now, it would be we would do an away and that's something we're experimenting with it. Dita Cam dive. Robinson has recently launched an intro to toddy verse course which doesn't presume any programming knowledge, but he gets you to import a data set straight away. So got mine to data set and then start getting out basic results and using GG plot to get out some plots almost immediately instead of going through full loops and printing ins to to console or whatever it may be. Yeah, I know, I think, frankly, before the tidy verse you couldn't. You just couldn't do that. I like, if you wanted to do the exact same if you wanted to create that exact same outpuut, you would have to know about subsetting.

You and your you wouldn't know. have to know about looping, or so you know, just because just in order to get to that point. And so the tighter verse tools have really kind of minimize that kind of overhead it to get you to the kind of that earl loose these early wins. Yeah, so I think it's great. I think I think data camps a great vehicle for that because also it kind of obviates the need for complex setup process these, you know, like stalling software and that kind of stuff. So I think it's great. It's great for just getting people to that first quickly show. So what's one of your favorite data signs, techniques of or methodologies? I know that's probably like asking you which is your favorite child, but will avoid that question. What something you really enjoy doing? Well, I do have a favorite child, but I only have one guy right, so I so I frankly, my favorite tool is just a simple scatter plot. I think plotting is just it can be. It's so revealing and it's not something that, frankly, I see a lot done and and I think the reason why I thought about why this is the case, and I think the reason is because it's one of those tools that really instills trust, I think, in the people who receive the plot because they feel like they can see the data, they feel like they can understand, you know, if you have a model that's kind of a relay, they know what how the data kind of goes into the model. They can reason about the data and I think it's one of those really critical things for building trust. I think obviously I do. I use machine learning tools. I've used all these kinds of sophisticated tools and I've seen just both being on both sides of the table in terms of giving people the output and receiving the output. When you see output from models and things like that, it's very it can be very useful, but often, I think this is a sense of like wow, there could be something that's missing here and I just don't see it right. But I think plotting, it just scatter plotting or whatever it happens to be, it gives you that sense. I'm it can be misleading, no doubt, but I gives you more of a sense that I'm seeing what I'm seeing, how the data kind of feed into the answer, and I think it's just a critically important tool. And and how would you do a scuttle plot in our so I think these days it would be probably be cute plot, okay,...

...but I have I've many years of just doing plot XY. So I think that's it. That's a good place to leave it because I definitely don't want to get into preferences between plot and and and cue plot at at the moment. I want to open that yet that kind of worms. So, Roger, we've come to the end of out chat and I was just wondering if you have a final call to action for all the aspiring and well season data scientists out there. For the well seasoned the ones, I'm not sure, but I mean I think it's I think for the people who may be aspiring to get into the area, definitely there's no better time to do it. There's the tooling and in the in the kind of information that's out there, it's just so abundant and, most importantly, I think it's just a fantastic community of people and and who will be who can help you along, to give your resources to kind of get you going, and so I just I'm very it's a very exciting area. I'm very excited to be just a small part of it and I'm curiously, I'm just excited to see where it goes for its count agree more, Roger. It's been an absolute pleasure having you on the show. Thanks for having me. It C in fun. Thanks for joining our conversation with Roger About Data Science and the environment. We saw several challenges facing the field of environmental bias statistics, such as the preponderance of complex and messy data sets, how the signal is generally weak and the difficulty of doing controlled experiments. We also saw the essential roles at data science has to play in this field, such as linking complex data sets together, building prediction models, quantifying uncertainty and communicating evidence and results to the public. Make sure to check out our next episode, a conversation with Adam Kelleher, principal data scientists at buzzfeed and Adjunct Professor at Columbia University. I'll be chatting with Adam about his work at buzzfeed and, more generally, the impact of data science on the modern digital media landscape. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and data camp at...

...data camp. You can find all our episodes and show notes at data Campcom community slash podcast.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)