DataFramed
DataFramed

Episode · 5 years ago

#6 Citizen Data Science

ABOUT THIS EPISODE

David Robinson, a data scientist at Stack Overflow, joins Hugo to speak about the evolving importance of citizen data science and a future in which data literacy is considered a necessary skill to navigate the world, similar to literacy today. We'll speak about many of Dave projects, including his analysis of Trump's tweets that demonstrated the stark contrast between Trump's own tweets and those of his PR machine. We'll also speak about ways for journalists, software engineers, scientists and all walks of life to get up and running doing data science and analysis.

In this episode of data framed, a data camp podcast, I'll be speaking with Dave Robinson, a data scientist that stack overflow. will be speaking about the evolving importance of citizen data science and a future in which data literacy is considered a necessary skill to navigate the world, similar to literacy today. Will speak about many of day's projects, including his analysis of trump's tweets that demonstrated the start contrast between trump's own tweets and those of his PR machine. Will also speak about ways for journalists, software engineers, scientists and all walks of life to get up and running doing data science and analysis. I'm Hugo bound Anderson, a data scientist at data camp, and this is data framed. Welcome to data framed, a weekly data camp podcast exploring what data science looks like on the ground for working data scientists and what problems data science can solve. I'm your host, Hugo bound Anderson. You can follow me on twitter at Hugo bound and you can also follow data camp at data camp. You can find all of our episodes and show notes at data Campcom community slash podcast. Dave, welcome to data frame. Hey, Hugo. Thanks so much for having me. Such a pleasure to have you on the show. I'm so excited to have you here to discuss data science, what it looks like on the ground as a working data scientist, what it's capable of and how everybody can be a data scientist, this modern concept of a citizen data scientist. But before we get there, a want to talk about you. What are you known for in the Ditosaians? Will so I do a few things within the data science community. So I blog about data science and also programming and education at my blog at variants explained. So there I usually just find interesting data sets and analyze them. So a lot of them are about baseball, sometimes about on programming or about by us, some things about biology, and I also tried teach statistics and programming concepts, like how to analyze particular kinds of data and how to make your our code run really quickly. And...

I share my general opinions about data science education. I've also written a book text mining with our that was published by O'Reilly and summer with my colleague Julia Sagy, and that's a book about analyzing text data using a particular set of tools that we call tidy tools. Particularly the tidy text package that Julie and I developed. And besides that, I besides tide detects. They all sort develop a couple open source art packages, like a broom force taking statistical models and turning them into tidy data frames, Gigi animate for creating animated graphs with Gigi plot to and fuzzy join that combines data frames based on inexact matching of their columns. So I do these couple of contributions to get hub and de Kran. So these packages of our that other people can use. And I've created three data camp courses. So my latest one was introduction to the tidy verse. That came out in November two thousand and seventeen. It's an introduction to a set of tools that we call the tidy verse, DPAR and Gig plot two that allow for data transformation and visualization that in a way that fits together really intuitively, and the course is designed for people to take it even if it's their first introduction to our I'm going to talk a little about some of that philosophy I have for how to introduce people to our and programming. Let in our podcast that that is a fantastic introduction. Do you dive and you've touched upon a number of things which will delve into deeper soons, such as the type of data sets you you've written about and enjoy writing about, what it means to actually explore a data set and publish these these results online and how how citizens can actually actually do that as well. You mentioned briefly your educational efforts, which will get to your books and, of course, tidy tools and the tidy verse which, as we'll get to, is a very interesting way for aspiring data scientists, aspiring programmers and people who want to discover more about the world through data. The tidy verse provides a wonderful set...

...of tools for people to get started with, and that's a little teaser for some things will we'll get into later. How did you get into data science? So it started with I was doing my PhD in computational biology. So we're in the field of genomics where generally analyze how we could tell whether genes were turned on or off within a cell, and it was a project where I was really involved in in I really was working more with program in and with statistics than I was with the actual biology. So I'd work with biologists that I try really analyze these large data sets. So in two thousand and fifteen, when I was finishing my PhD, I was originally planning on looking for academic jobs and then I happened to get reached out to on twitter by someone from stack overflow, that's the programming Qa website that I work for now, and he'd seen this post that I'd written a couple of years earlier. So couple of years earlier I'd answered this question about the Beta distribution, particularly how can someone understand the Beta distribution intuitively? And these years earlier had written this detailed answer and right as I was looking for a job, you'd happen to find it and someone had the idea of what if we are they've been looking for a data scientist, and someone had the idea, what if we just hired that guy, and was this answer? Was this the answer that involved the use of Bible Statistius? Yeah, so for me I really liked explaining the Beta distribution based on trying to figure out someone's batting average. So in baseball someone goes up to bat, let's say a hundred times, and if they get a hit, thirty of those times they're batting averages point three and estimating that about a players one of the ways you can tell who's a better player or worse player. What. Just hold a hold up for a second. That you're telling me that statistics can be explained not just through flipping coins but through real would applications like baseball. It's really important to be to find analogies that speak to you person only. It's also so there's a part where to find analogies that speak to your audience, and one thing that concerns...

...me about that analogies. It is particularly useful to Americans who follow baseball, but I think it's one way of attacking and educational problem from many sides. Some people want to read the math directly, read the equations. Some people want some really general examples. Say you're flipping a coin, you're figuring out whether it's fair, and some people like examples that use some real data and use some real problems, and baseball provides a lot of those. Someone else might have done it with clicks on a website, which is what which is usually what I would be be using these methods for in my action, in my work at stack overflow. Someone else could do it based on analyzing text within literature or identifying images. There's there's. Yeah, there's many ways to teach the fistical concepts and I think everyone trying to teach them in different ways leavest this really healthy ecosystem. So you mentioned that you got started doing computational biology and genomics at Princeton. Who I actually should is that I actually wrote a took that answer and the long term, ended up writing many more posts about it where explored how baseball could be used to explain that the method of Empirical Bays, and I built out, built it out into an entire ebook which is available on which available on my website, is introduction to empirical bays. You mentioned that you got sut in computational biology and genomics and this sounds like science to make. Well, it clearly is. But is that data science? I'd say that in a sense I was doing data science sort of before there was a term for it. I think a lot of current data scientists have been doing this kind of work for a long time. The ideas I sit within a scientific department. You have some people that are really interested. You'll some people are really interested in biological questions they'll be really interested in cant and what genes affect this disease. How can I have changed the genes within a cat tumor to treat it? And other people will be really interested in computer science questions that we went really interested in. How can I make this algorithm run faster? How can I sort this list? How can I...

...build a database that can be queried well? And I think then data scientists tend to straddle both of those fields, the people that are really interested in learning programming and then taking scientific data and working with it. And for me it really then was it was a process of I'd learn enough about computer science that I could I could apply these methods, that I could program and work with large data sets. I wouldn't be doing things by hand, and then I learned just enough biology that I know what kinds of problems they would need to solve. And then I really liked I'd grab some data, find out some things from it and move on to kind of move on to the next status and I'd work with collaborators to do that. And I think a lot of people in my field have that same kind of impulse and, as a result, it it's a very good position for us to be where we can then move on to when we and our work at a web company and I and instead of analyzing biological data, I analyze data about visitors, or I could go to data set of tweets, and I can. I can analyze it and make some graphs and find out things about it. It's once you get what's your really favorite thing to do is grab data, understand how you need to reshape it and process it and how to draw a draw and then communicate conclusions. That's I think that's where the data science lies, not just in biology, it's not just in computer science, and so it's taking any any dat is it really and figuring out how to extract information from it, which of course involves a lot of reshaping, filtering, munging, Dita, cleaning, but extracting information from it and then being able to communicate that information exactly. Those are the things that are in common across any data analysis, no matter what field you're working and there's data transformation, say, how can I clean this and how can I reshape it? There's statistical inference, for how can I separate signal from noise? There's prediction, for I've got some inputs and I want to classifier or predict the outputs. There's visualization, for...

...how can I better understand this by make it into a graph, and there's communication, and sometimes these are sort of taken for granted in within the sciences, as they are, those are just something that everyone does. Data side is the people that are really interested in that process and interested in making it more effective, and it just so happens that computers are a great way to do this now with the amount of data we have and the languages that allow us to reach Ote Dat or and produce figures and do statistical inference. Absolutely. You know, one of the founding figures of data science was named John Tookey and he was a princeton statistician who lived, I think, a bit early for the he was at least he was born and I think one thousand nine hundred and fifteen, and it did a lot of his best work in the center of the century and then computing was very early, but he was he was sort of he was really laying aloud of the lot of the groundwork he and he invented some new some kinds of graphs that would start to they were able to better communicate data. He largely invented the idea of exploratory data analysis, of getting a data set and then figuring out what, what factors within it might influence each other through an iterative process of asking questions and getting answers out of it. And what I find interesting about that is that he was he was scraping the he was using these very early computers to solve these problems and as computers have gotten more powerful, it's been a lot of people like him that have been able to move the data signs field forward and keep up with it. And I absolutely, I absolutely do think that. Yeah, I do think that computers are some are sort of bound on how how effective data sience can be. So you've spoken to the fact that there are so many different types of data sets that have interested you and that you've worked with, from web data in your current position, to computational biology, genomics, the data sets to using the twitter a peon. What what Ditasians projects have you worked? Don't be involved in that you think of the most impactful...

...on society or telling about society, or that you've just love the most your self. So one set of analysis have done that I really love is I work for stack overflow and it's a programming question and so website. So when people google a programming question about, say, how do I sort of list in Java or how can I get to have make javascript to do this particular thing on a website, they'll often end up on one of our questions. And with twenty million hits a day, we get a lot of great data about what programming languages people are using and we get and we get to analyze that to see how these languages and technologies that are used change over time and are different around the world. So an example, the cool analysis we did is what kinds of programming languages tend to be visited late at night versus between nine and five. So we can see that, for example, the programming languages built at tools built by Microsoft, like see sharp and excel, they tend to be visited a lot between nine hundred and twenty five. They tend to be ones that people use to do their day jobs. And there's other languages like Haskell, up functional programming language that is much more visited late at night. So that's the kind of thing that people would do in hobbies or maybe as passion projects, and I think it's really cool the kinds of insights that we can have and that we can share with the world based on the Statua and and that can all be found on the stack overflow blog. That's really cool. But another analysis, probably the one I'm best known for, was in the summer of two thousand and sixteen, some people had observed something about Donald Trump's twitter account while he was running for president, and that was it. That about how a lot of his tweets were sort of typical candidate tweets. They were tweets that contained, say, a picture and him saying thank you, North Carolina, or I'll be on Fox news tonight. It would be these these announcements. But then there are also were a lot of tweets that were a lot angrier and kind of kind of very passionate it or...

...somewhat unhinged. And someone had had the realization, and I went through and use data analysis to confirm this, that the difference between these tweets is the ones that trump writes and the ones that his campaign rights, and the difference is that trump writes all the tweets from the ANDROID and his campaign uses an iphone. So by looking at that data from each tweet, you can figure out whether it was donald trump the candidate, tweeting or his campaign tweeting for him. And the particular analysis I did show show. It showed some really big differences between the two devices, and the biggest was that Donald trump was much angrier. You can use sentiment analysis, and that's a tool we really use a lot in my book, to tell are what kinds of words does he use? That that the word, because words the android use the iphone wouldn't, and see that they're much agry or more likely to use crazy, bad sad was. The IPHONE was more likely to use positive words like whin or join or thank you fantastic. And you notice something about the temporal distribution as well, didn't you? Yeah, exactly. Trump tended to tweet earlier in the morning and late at night for himself, while the campaign was more common betweeting from nine hundred and twenty five. A few of the differences he some of the really telling signs. Trump himself doesn't use hashtags and or photos or links. His tweets are almost always just text, and it was their campaign off in woods. That's one of the ways that you could tell there was a really big difference between the two devices. And Yeah, that that analysis got a lot of attention. I get and interviewed in a number of news articles and you're on CNN recently. What you or yet CNN cut did a documentary about trump's twitter account and how important it's been to the last couple of years in his particular political journey, and yeah, I was one of the people interview for it. Cool. So I need your help now day to day mystify something. If somebody read a lot of people will have read your analysis of trump's trump's twitter account, and what you've written is an incredible...

...finished product that a lot of people might rat it and think that you went in having a hypothesis, had a very strong idea of what you were going to find and you just did did the analysis. But that isn't really how detosons of that or analysis works, is it exactly? So, when I went into this analysis, I was interested in at seeing this question, but I'd never actually queried twitter data before. So one of the first is I had to do was just was look into how to do that, how to work with the twitter our package in API and because I had these basic programming skills, these things they developed in other projects. I was able to use documentation and figure out how to query twitter. So once I, once I downloaded the data set, I'd basically by using these publicly available tools, I was able to bring in my data set and then, by using the same kinds of tools that I've been working on for my book, with for analyzing text and other tools for visualization, the same kinds of tools that I would use for visualizing any data that I had, I was able to build to put this post together. So one of the things I love about the data science world is that a little knowledge of tools can be very flexible in terms of the kinds of problems you can answer. For sure, and presumably you Wilso we're talking about John Touki and his ideas if exploratory diet or analysis, otherwise known as Ati. Presumably that also played a huge role in what you did. Oh yeah, I think there's a huge part of practice that comes into doing in data analysis like that. Just I think in totally spent about twelve hours on that blog post and clear the downloaded the data, analyzing it, finding some conclusions and writing it up, and it's not a bit it's not a particularly Polish blog post. Certainly I think something that I put on my company blog would usually be a bit more polished. I might spend a bit more time on it, but I think this it's really once you've done a lot of data analyzes. I've done a lot of for biological data. Then I don't done a lot within my company and then they've done some for other other blog posts. It's it's a really fun process and it's one where...

...your habits can build on themselves. Once you learn how to use a tool, you can keep using it in the future and you it kind of speeds itself up. You start to learn to build these tools faster and faster. So it's really one of the one of the reasons, as I'll talk about in a second, that I really recommend people that are interested in data science start a blog and write blog post analyzing particular data sets. There's just no better practice really for for understanding how the whole process of exploratory data analysis fits together through to communication. Let's now jump into a segment called blog post of the week with data camp curriculum. Late spens a Bouchhei, what's up, Spencer? Hi, you go. So, Spencer, you're here to talk about a blog post that you read this week and loved. Yes, I sure am. This week I really enjoyed a bunch of great reads, but one standout was Robert Chang's medium post advice for new and junior data scientists. Will include a link to this in the show notes. So tell us a bit about Robert Changing to sit the same. Robert First became a wellknown data science blogger back while he was working as a data scientist at twitter, and now he's on the data science team at air BB. Why did you enjoy the post so much? All right, so Hugo. You know how it's so easy to get caught up in the math and the coding that we forget to take a step back sometimes and think about data science as a process and as a career. In his post, Robert Invites us to do just that by laying out a set six principles that he wishes he had known early on in his data science career. and which principles did you most identify with, spend so my favorite has got to be the importance of declaring your desire to learn early and shamelessly. As he puts it, imposter syndrome is such a huge problem in data science. Feeling like you don't know as much as your colleagues do can cause us to try to pretend that we already know things that we don't, just because we don't want to look silly. In reality, though, you might be surprised just how far others will go to teach you...

...things if you're just sincere about what you don't know but want to learn. Where any other standouts? Yeah, so Robert also recommends paying very close attention to all of your dissatisfactions. Instead of focusing on the negative aspects about what you're unhappy with, let these things guide your next steps and take your data science work to the next level. And what about you, Hugo? I know that you've read this post. Did anything jump out at you? I'll look spend so. There's so much mate in there. I think one of my favorites is identify the tools that will help you solve your problem. Robert Discusses these in terms of technologies, for example, should I use python or sequel from my particular problem? But remains the same for all types of tools, such as which models to use, for example. Why throw deep learning at a problem that I'll just stick regression can solve. Well, think about your problem, your question, and let that define your choices. Otherwise everything becomes a nail. Absolutely any others. Hugo. Yeah, I agree completely with Robert that teaching is the best way to test your understanding of the subject and improve your skills. Not only that, explaining and communicating right blog posts, for example, explaining what you're doing and your work will become richer. Oh Man, I want to read Robert's post again right now. Yeah, totally. Robert goes into detail about everything we've talked about and much more in his post, so be sure to check it out, listeners, and if you enjoy it, be sure to check in for the next episode of data framed. Our guest will be none other than Robert Chang himself, here to talk about data science at Air B and be thanks for telling us about your favorite blog post of the week, Spencer. Until next time, Hugo. After that interlude, it's time to jump back into our chat with Dave. Something you mentioned which really caught my attention was that you hadn't had much experience with the twitter API or interacting with with twitter using a programming language before,...

...and I think that speaks to a really interesting fact that as data scientists, what we do is where learning stuff on the fly constantly. We may bring skills such as your knowledge of our already into play to use the twitter our package and an API, but people reading your blog I think you're a well seasoned tweet analyzer and I think a lot of people on the outside, as well as a sparring data scientists, may consider themselves nontechnical or noncomputational or even non mathematical, and considered data science that out of reach. So my question for you is, can anybody do data science? That's a really interesting question. I think there's a there is a there are some people that would say no. They would say it's kind of dangerous to say everyone should learn some data science, everyone should learn some statistics, and the reason some people give for saying that it's not a good message to spread is that there's a lot of danger in statistics and machine learning being misapplied. Well, particularly dangerous case is we're seeing a reproducibility crisis in science where people that had had done these scientific studies with they try and say the tech particular psychological effects and they they're discovering that that all these things. Papers that have been published that looked like they were statistically backed up. Other people aren't able to get the same results, and one of the major reasons for that is abuse of statistics. favorous example would be pe hacking a pe, that is hacking a p value by looking at P dis a way that that's that scientists can tell whether an effect is real or just noise. And if you try looking at a bunch of of different parts of your experiment, like if you if you run an experiment on a thousand people and you say and you don't get an effect, but you say, Oh, I'll only look at the men, or will only look at the women or I'll only look at people over forty. If you slice and dice your data enough, you can make an effect, kind of...

...a peer out of nothing, but it won't be real, and that that's a danger and not having enough statistical education can be can lead to it. And another big problem is algorithmic bias, and this has gotten a lot of attention recently, as there's a problem with people design same machine learning algorithm to decide who gets a bank loan. If they just take some data and train on it without really thinking it through some of the implications. They can end up building racists or sexist systems without even intended to, ones that discriminate against people based on on where they live or or, say, how they act online, and these can have real consequences. They've had a consequences in within finance, within the tech industry in terms of if you just listen to an algorithms recommendations without without properly controlling for some confounding factors. I think I wouldn't give a full exploration of a here, but there's a lot of really great work being done to understand all these dangers. So we everyone agrees there's that danger and the question is, what is the solution? How can we try and protect against it? And here's where we differ with with some people. Some people, I think, would say this shows you have to everyone needs very rigorous statistical training and you can't be allowed near numbers until you've done it. You can't be allowed to implement a machine learning algorithm until you've had this all this training in it. Therefore, we should be careful about trying to say anyone can analyze data. It's just going to lead to data being analyzed poorly so I understand that approach, but I think it's not the right way to handle the problem. So Hadley Wickham is a really influential developer in our he wrote the popular DPLAYR and gg plot two packages, and I really like how he he compared this approach to absence based sex education, where they say you just can never have sex, there are these dangers behind it. Only...

...have sex once you're once you're married. So he says this. This approach of saying you should ban everyone's from statistics until they've learned all these tools is like absence based stats education. You should only do statistics if you're in a committed, long term relationship with professional statistician. So this has the exact same problem that abstinence only sex education has, which is people go and do it anyway. If you start just trying to put up these walls and say, well, I'm not going to teach you any statistics because you're just going to miss apply it, or well, if you really want to, if you want to do machine learning, you would better get an entire PhD in this subject first. You're just making it completely unattainable. So that's why I talk about what I think of is the citizen data science, the idea of someone who doesn't work as a data scientist, doesn't have cistical training, but still learns things that are useful for them to analyze their data and learns the pitfalls and the and the approaches they can use to handle data safely, and learns them early on and start supplying them. I think it's better really to be upfront with with education and say everyone should learn some basics of programming, of statistics, of Data Analysis, of visualization. I think it's useful for software engineers, it's useful for journalists, it's useful for scientists across the entire spectrum, psychologists, biologists, physicists, it's it's useful for executives and companies that might want to understand their own data learn. I think that learning a little bit about how to handle data, to understand relationships, to do statistical inference, to build graphs and communicate about data. I think it's useful for everyone, in the same way that writing is useful for ever everyone, not just professional writers, social and knowing where to look for help as well and where to as for Hell exactly, and and that involves building a welcoming community where,...

...if you say I want to do this, the answer isn't. Well, you shouldn't be. You shouldn't be analyzing data. You're not a statistician, and the answer should be great. Here the things you need to know basically building a welcoming environment and one that has all the resources ready to say if you if you need help with this, here are the things that you here are the things we're going to have ready for you to learn to use it. So I look back at my trump analysis and I realize nothing in there was particularly advanced statistics. Nothing in there I really needed my PhD for. That's the kind of thing that a journalist could learn if they were committed to learning some some r and then and some basics of data visualization and and some programming. They could have taken that trump data, found out, found the same patterns that I had and written about it. And I think a world where journalists are able to Parse Thatta that way is one way you see a lot of great journalism happen. I think the same thing is true across almost any feel I think people are better at serving their business if they're able to analyze the data. So yeah, so what we really talking about you. This kind of struck me when you mentioned the idea of writing with Ting about some sort of citizen Ditas science actually involved some sort of basic Ditosians literacy across the community. Exactly that. A literacy would be exactly how I'd say it, and I note that that's different from anyone can be a data scientist. Being a data scientist would be is professionally involves having like any field, and involves having skill, experience and a lot of focus in that particular area. So I think there are a lot of paths to being a data scientist, but it does involve a good amount of commitment. On the other hand, any are almost anyone can be data literate. They can understand I'll download some data, I'll build a cup, build some graphs out of it, run a few regressions at to understand how variable x effects variable why and understand some of the pitfalls that underlie it. And even if they aren't doing the...

...programming themselves and might be able to talk to someone else who is doing and understand the the pitfalls and the dangers of it. So I think there's a there's that. A literacy is not a either you are or you aren't ready to work with that. It's really a continuum and I think the deeper everyone gets into it, the more these these skills spread throughout the field, the healthier the entire community can be. But we've touched upon the fact that when doing diatoor analysis or having some sort of data literacy in the modern age, programming ability is some sort of prerequisite. Maybe not extreme programmingability, but I think a lot of people, for a lot of people, programming ability is is a potential barrier to entry for data science. What technologies do you see emerging that are reducing this barrier? So I think this is one of the biggest challenges facing not just the modern data science community but the entire world, is how can we spread programming ability most effectively, and I think there are two fronts to this. I think there are two ways that we can approach it. One is making programming more intuitive and the other is making education more effective and avail billible. So in terms of making programming more intuitive, I really like what a how Abelson, as a computer scientist, who said programs must be written for people to read and only, incidentally, for machines to execute. So the ideas that when programs were first being developed, it was all about how can I get the computer to solve this problem? But fifty years later we're in a very different world. We're somewhere where we we really should be thinking of, the problem being how can we get programmers to understand this program how can we design our tools in a way that are easy for people to use? I think a lot of people are doing really great work here, but one of them is Hadley Wickham so. I mentioned him before. He created the DPR and gg plot to packages that are the center of this set of tools called the tidy verse and how they like to say these...

...tools are developed for humans. That is there. There's a lot of thought put into how can you make them fit together intuitively in ways they're easy for beginners to learn. So big thing there is consistency. An example will be when you have many functions, that that these many tools that all work with strings. A few ways can make them consistent and therefore we're easy to remember, is you can name them in similar ways. He starts them all with str underscore. These are all tools that work with text strings and they're named so they work together. And another way to make sure that they all take their arguments, that is, their inputs, in the same order. So say that you always have the string that's being operated on be the first argument, the first input to this function. So there's a lot of details to that and he lays it out in a great document called the tidy tools manifest though. But the really important thing is the consideration that when one's building tools, to think how can it be beginner friendly? How can it how can it be something that people that are new to programming can learn without making lots of avoidable errors along the lines of, Oh, I forgot that this function had this name, but this one had this name, or Oh, I forgot that this requires this format, but they should be in another format. He thinks a lot of it making these tools intuitive. I think the naming of the tools is incredibly important in terms of having a structure in which we can actually, you know, implement the techniques we we want to in a kind of consistent way. I also think something that had Lee is really locked in on and the tidy verse in general, is that we all have ways. We have a collective way of thinking about data already in our head before computation. So we think about, you know, GDPs of countries or birthrates or whatever it may be, and we can talk about them. We can talk about in a particular year or which has the highest but birth rate, these types of things, and the actual computational structures that have been built in...

...deeply, for example and gg plot, mimic the patterns that we actually thinking cognitively. So you will write code that mimics the way we think about the data set and talk about it right exactly. I think a lot of the center of that, and it's happening not just within the tidy verse and not just within our is based around the data frame, the data frames particular structure. I think it's revolutionizing a lot of a lot of the ways people build data science tools. It's the way of structuring data to think of it in rows and columns. Highly likes to define these in terms of each rogue gets one observation, each column gets one variable, and building tools that take that as their natural import and thinking about transforming your data in this rectangular format with rows and columns. And it's amazing to look back just a few decades at how much data analysis and how much programming was usually done using using objects that were shaped just very differently they were. People thought in terms of lists and they thought in terms of dictionaries that map a key to a value. But the idea of the data frame is being the natural unit of analyzing data. It's just it's just very powerful. And another place that's happening is in pandas in Python has made a lot of these tools really easy to work for humans to work with by structuring it all around data frames. So dive as educators. What what can way do to low the barrier to entry for ditasigns and programming? Yes, so I think educators of a really important challenge going on right now to make programming and data science education more available and more effective, and I think for that there's two really important considerations. One is just diversity of learning styles. Everyone learns in different ways, so we need to teach a lot of different ways. Some people really like learning from books, so Hadley Wickham has a fantastic book are for data science that serves a great introduction, and our own book text mining with our the thing that we've done is make this available for free online as well as being sold, and that's one...

...good way to make it educational resources really widespread. I also think data cap is really is doing really terrific work for people that learned well from videos and from interactive exercises, through learning, by doing both of these candle the other problem, which is scale, which is learning in colleges is just not enough anymore. There's not enough first world colleges to teach the the amount of data science that we that the world really needs and it's not realistic with how fast the field changes. People are going to be entering the field later in their life after their formal education and people are going to need to keep learning. So I think tools that such as massive open online courses like data camp and resources like books, resource like documentation, the go, the go, the programming, there's all these things that need that. There's this real mass effort to improve them and make them more and more accessible to people that that might be new to programming. I think the our community is doing a really fantastic job. Now another place that it's that it's it's been really productive, is the our studio has this some community dot our studio dot org website that let's basically discussion form that is really a welcoming place for beginners to go and get some feedback and build a network for how they can learn. So that's, I think, some of the ways that I'm that education being made can be made more available. But I think it's really also important to consider the order of education that concepts are introduced and the really the ways people are introduced to it, so that it can be well oriented to towards citizen data scientists, towards people that might that aren't looking for an entire PhD or an entire career in data science but want to learn enough to do their jobs better. And for me that comes down to the really important educational concept of teach powerful concepts quickly, teach people the tools that they can immediately apply. So my introduction...

...to the tardy versus based entirely around this. I don't start by teaching programming concepts. I don't start by teaching variables and loops and functions and lists, and these are a lot of programming concepts that people might typically start with. I start with the data. I say here's a data set. Would going to be analyzing it, let's ask some questions, let's get some answers, and by the second chapter of it they're already building graphs out of this data and starting to really gain insights from it. So I think that's just that's a really important approach. When you have someone like, say, journalist, someone who's busy but wants to use some are in their work. They're not looking to go through a long educational process, they're looking to get something done right away, and that's how they can build. They can learn something, use it and build on it to keep doing better and better work. That's awesome because speaking to that, I think, is this idea of learning in order to be able to do something straight away. Is, as you say, which is analogous to what we were discussing earlier, that learning statistics not necessarily to find out about the binomial distribution but to find out something about about what's happening in in baseball statistics or genomics or jerrymandering or voting systems. Right exactly, people are had those making it practically yeah, people have. The world is filled with problems and questions that people want solutions and answers for. And then and challenge involves them being able to get their hands dirty with the data. So if you give people a programming lesson and you say it will be useful someday. That's not that's not going to they're not going to stick with it. But if you give people a if you tell people here's how you can start solving questions at your data right now, that's the way that you can really build their data literacy, and that's actually something that's really cool in in your intro totiddy verse course, because a lot of introductory data science programming courses will like tell you how to print into just one through ten or...

...zero through nine, if was zero index in a for look right, as opposed to showing from the gap minded data set, how to plot how literacy has improved since one thousand nine hundred and eighty two now in over a bunch of nations right exactly. This is something that there's a I feel very strongly about, with with a guard to an educational philosophy of starting with by have a goal for what you should get students doing and get them doing it very quickly. I'm really looking forward to seeing how seeing whether that course is successful in accomplishing this goal is, collecting some data on that and and really seeing how how it can help spread understanding of our and let people work with around data. Now, for something completely different. We're going to dive into a segment called rich, famous and popular with Greg Wilson, who wrangles instructor training here at data camp. What's up, Greg, today? What are you going to tell us about today? I'd like to talk a bit about get and about how we can use data science to make it accessible to the ninety nine percent of humanity who aren't using version control today. Now, some people claim that get is intuitive, but it's actually the most complicated program I've ever tried to teach, and I've taught people emacs. Some of that complexity comes from its inconsistent command syntax and jargon, but I think there's a deeper cause. Way Back in one thousand nine hundred and sixty eight, ed good dikester argued that arbitrary use of go to statements led to programs that were hard for both human beings and compilers to understand, and that we should instead restrict ourselves to a handful of control structures like for loops, ifl statements and subroutines. And that might all seem obvious today, but it wasn't at the time. It took a decade for programmers to realize that limiting themselves to a small subset of all possible flows of control when building software actually made them more productive. We haven't had that collective epiphany yet when it comes to distributed version control systems...

...like get. At its core, get is a way to construct and manipulate a graph that represents changes made to a project. It allows programmers to make almost any change to that graph they can imagine, and the results can be just as hard to understand as the tangled flow charts of our distant ancestors. That in practice, though, most of us only change the graph in a small number of ways. That got me wondering whether version control will have its own structured revolution, and I therefore have a proposal. Step one, download data from several thousand large projects on GITHUB. Step two, use data science methods to find patterns in those projects graphs. Step three, select a small set of common subgraphs that will cover most everyday uses. Step four, build a tool that provides those, and only those, to users so that they have got a more structured environment, a more predictable way of working. And then step five is profit. No, step three is a bit speculative. I have no evidence that usage patterns will actually fit a long tail distribution, but I think most of us would be surprised if that wasn't true. And step four is the one that will lead to all the shouting, as happened when structured languages eliminated. Go to statements. A minority of very vocal programmers will point at fringe cases that can't be handled by your chosen set of simple constructs. However, everyone else, the people that Microsoft's Scott handsome one calls dark matter developers, will thank you and thank you greg. If anyone in the audience is interested in giving this a try, please get in touch. We'd love to hear from you. Thanks, Greg, and looking forward to speaking with you again. Thanks you time to get straight back into our chat with Dave Robinson. So we've discussed a lot about what us as educators can can do to help people get started with with that designs. How can people out there get started with that designs? What type of action can can now...

...inspiring data signs listeners take? So I think the two steps are find some data that you're interested in and a question you have about it, and then learn how to work with it using programming and statistical tools. So finding data that I think there's a lot of really terrific data resources and there's this great community of open data where people share our people have been companies and communities and academic institution have been sharing a lot of their data that people can analyze. CAGGLE has many stiven some on our packages that share data sets. My datacount, of course, is built around country statistics in the GAPMINDER data set and it's worth contributing to that if you can. If you have interesting data, it's probably worth sharing it. So that's so what you have data and whatever you want to work with, then how can you learn to program with it and do analysis on it? I think that's worth just getting in touch with the right educational resources. I mentioned some before. Are for data science is a terrific introduction if you're interested in learning. Are My own book. Text mining with our is great if you want to if you want to analyzed text data and if I think I really do recommend data camp. I think it's a great way to build python and our skills towards analyzing your own data sets. But it is important that when you're learning, you have something, you have a goal, that you're interested in working with yourself. Another really important educational resource for anyone is their network. So finding people that are already data scientists or or are also interested in building their skills and data science and learning from them and sharing resources with them. So I really recommend the our stats Hashtag on twitter. So I tweet a lot and I think this is a really friendly and welcoming group there and if you go in there and you ask advice, people often will help, or if you share work you've done, people will share it themselves. You can also go to meet ups in your...

...local city. So I think one global organization that organizes a lot of terrific meetups is our ladies. So, especially for women that work in our it's a really great way to look for your local chapter there in New York, in London, in Istanbul, so a lot of cities have these. These are meetups and it's worth going and meeting some people and building this network. It'll help you learn and build your skills faster. And if you're interested in being a professional data scientist, I especially recommend starting a blog. So that takes a lot of time but it's really worth it. It's there's a lot of things that you can that you can share in a blog and build your your skills. So one thing you can do is find an interesting data set, analyze it share the results so that, let's see, improvious skills, not just data at analyzing the data, but at communicating the results. If you're more experienced in statistics and or mathematics, you can teach a statistical concept in your blog post. I think that's something that I've built a few posts around and I think it's really important to share the kinds of knowledge that you have them that not everyone does, and that's actually how you got your current joke. You said earlier, was essentially explaining, as the studies will concept of the bite a distribution exactly, and the thing that astonished me about that was it's not a particularly advanced or expert statistical concept. It's the kind of thing that would be introduced in an introductory probability worse. So it's not that it was such a advanced concept, it's not the thing I was most expert in or that very few people knew. It's just something that I knew, that professionals knew, but a lot of the world hadn't been exposed to and was and people like the software engineer that found it and was and ended up contacting me about the job. It's something they were able to learn from. Fantastic. I have a blog post on variants explained about this. It's called advice to aspiring data scientists. Start a blog where I really share a lot...

...of my arguments for this and I also make an offer. I say, if you're interested in breaking into data science and you write your first blog post, you should tweet at me, at at D Rob And I'll share it with my followers. So I've I have a good number of followers and also, importantly, a lot of them are really our data scientists that are really interested in spreading the work of people that are newer to the field. I love this idea because running a blog post really forces you to go from collecting data through the exploratory data analysis, through to either modeling or statistical inference, whatever you do, to visualization and actually communicated your results. So it takes you through all the steps that are involved in data science right exactly, and once set in the yeah, it's it's not just that once out in the world that can help to get a job, it's the entire process of constructing that is like a data science project in microcosm. And actually recently you should a blog post from a MESSO'S BOOT camp student. Right, yes, it's right. Someone was learning to someone was in a data science boot camp called met US and they'd heard about my my offer and shared and share this article with me. And it was a really terrific data analysis where it looked at comments on the FCC's website about net neutrality and it discovered that more than one point three million comments about against net neutrality were faked. So they were all like they were all almost the same comment, that they would change just a few words. They were created by a Bot, and he by by detecting that, he was able to say, in fact, the all the comments in the page looked like they were a mix of positive and negative. They were the real ones made by real people, were overwhelmingly positive. So he shared this post and I treated some about it and end up getting a lot of attention. So this is someone who's very it was early in their career, Jeff cow who had he got interviewed by The Washington Post, and sharing...

...all these in all these publications and might have a real effect on the future of net neutrality in the country. So I think it's another really terrific example of someone relatively early in their data sis career having an enormous impact, and I think it's probably worth getting slightly more more technical with respect to how we blow, especially for ast baring gitos artists and discuss the advent and how prominent Jupiter notebooks are. Markdown is now for writing computational narratives where you can explore and explain your entire data analysis and then creating blogs for from that. I think you use Jackop for your blog. Is that right? I do. Yes, and it's built with knitter so knitters away the way. I can write a notebook where I interspersed text with code and it generates the post for me so that I can show my text, my code and my figures all at once. I think it's a really notebooks are a brilliant data science technology because they can be used to share results within a company, they can be used to share results within the world. They can be used even for work that you don't plan to share, so you in the future you can understand how it was, how is executed. Yeah, I think I think it is. It's great how the notebook technology has kind of has kind of led straight line from there to blogging and now altering format on. Our data camp community is entirely through notebooks. I will push your notebook to a github repository in order to create a post on on the data camp community section, which is very cool and a lot of fun. So we've talked about a lot of different data science techniques methodologies. What's one of your favorites or something you just love doing when doing data sides? What really gets you going? So this is a simple technique, but it's one that I think is really underrated and it is really kind of one of my favorites. It's learn to put something on a log scale, so that is, take it from a numbers were go as one, two, three, four, five six and think instead of a scale that goes one, ten, a hundred, a thousand.

So that's really important when graphing because so many sets of numbers that we work with in the real world are exist on scales that are much larger, that are these multiple different orders of magnitude. So an example would be if I look at the GDP per capita within it, within each country, I can see that I'm that some countries of a GDP per capita in the hundreds and some in the tens of thousands. And if I try and PLOP that just on a scale, on a scale where it goes one thousand, two thousand, three thousand, it's going to cram a lot of interesting and important data into one small part of the graph. But once you turn into a log scale it ends up being much more meaningful. So I find there's just so many visualizations and so many statistical methods that that become more usable once you get used to thinking about data sometimes on a log scale and switch back and forth when it's useful. If that was incredible. I don't think I've ever heard someone explain why log is is so important so well, let alone without having the ability to actually visualize it. So doing that on a podcast is is a fate of wonderful data science education, I think. So we've discussed a lot about modern data science. What does future data science look like to you? What's going to happen in the next two, five, ten years? So I think it's a really exciting time to be a data science test there's a lot of new tools being developed. For me, the really interesting question is that there's a convergence between tools that make data science tasks easier without programming and tools that make programming easier. So examples of the former tools that don't require programming include a tableau and looker and periscope. There's some tools that are very good, but they're but there's sometimes meant to automate away your data scientists they like. They like to say you don't need a data scientists anymore. You'll have this tool or you can use this without programming and the others,...

...and those tools, I think, are getting better and better. But from the other side, programming is becoming more and more accessible. I think the tidy verse is doing terrific things for that in our I think pandas is doing terrific things for that in Python, and so I think these two are converging with a question is going to be are the non programming tools going to get so good no one needs to program or the programming tools going to get so intuitive that everyone will program? And I think the long term I think programming will win. I think the programming as part of data analysis will be a typical and valuable business school skill across many fields, just like writing or public speaking. So not everyone writes as part of their job, but it is considered an important part of so many jobs. I think any CEO would it be expected to do some writing. Any A lot of people in I'm in software engineering. I expected to be able to communicate effectively and I think that that that these data literacy skills, as program becomes more and more accessible, are going to become more and more widespread. So I think I'm I love to see a future where, in tender or more years, where tools like DPLAYER and Gigi plot too or whatever's next in that space, become really, really widespread across a wide variety of fields. Beyond that, professional data scientists so does. It is always tough to say goodbye to you because we, I always enjoy our conversation so, so very much and we could wax lyrical for hours on on such things. But in closing, I'd like to know if you have any final call to action to our listeners here. Yeah, so I've provided some advice here and I definitely say I'm really recommend read about data science. I'd say the book are for Data Sientus by how they wouldham and certainly recommend checking out my book and with Julius Selby text mining with our a tidy approach. I'd recommend the people blog about data science and if...

...you do, you should tweet me about it. That's at De rob on twitter, and that I'll let me share it and have more people read it. And I'd recommend people take my data camp course. That's introduction to the tidy verse. It's a great introduction to our the people that haven't rather having programmed. It's great for people that want to be introduced to the DPLER and Gigi plot two packages, or just want to review you how these maybe they've used the Deplayer engeply two packages want to see how they fit together. Recommend checking out the course and I'll add to that that it's a great introduction to computational ways of dealing with data and exploring a computational way to mimic how we cognitively think about data and the patterns had emerged there. Yeah, I really like to thank so I think it's it gives a real taste of what a day in the life of a data scientist is like. Absolutely, Dave, it's been an absolute pleasure having on the show. Thank you. Thanks so much. You go ahead. A great time thanks for joining everybody in this conversation. With Dave Robinson, we heard about the increasing importance of data literacy and society and how everybody out there can take action to become more data literate, whether it be by working on small projects that interest you, reaching out to the data science community at large for help and advice, or getting started by writing blogs of your very own projects. Make sure to tune in for the next episode of data framed. We're I'll be chatting with Robert Chang about data signs at Air B and B and Twitter.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (121)