- Software and the physical world
- Three organizations pressing for change in society’s approach to computing
- Four short links: 16 May 2013
- Four short links: 15 May 2013
- Four short links: 14 May 2013
- Big data, cool kids
- Four short links: 13 May 2013
- Four short links: 10 May 2013
- Yet another Kickstarter: Otherlabs’ Home Milling Machine
- Where will software and hardware meet?
- Four short links: 9 May 2013
- Four short links: 8 May 2013
- Steering the ship that is data science
- Four Short Links: 7 May 2013
- Another Serving of Data Skepticism
In this episode of the Radar podcast series, Jon Bruner and I are joined by Mike Loukides as we muse more on software and the physical world. No coffee shop clatter in the background this time around as we were forced by geography and time to talk on the phone, but I still managed to have a good cup from my favorite local cafe in my hand. In the course of our conversation, we discovered that Mike drinks tea, so this may be his last appearance. Our discussion ranges from the declining cost of 3D printing to ham radio antenna design. Along the way, we touch on the ease with which data scientists can build data sensing motes with open source and open hardware components. We hope you enjoy listening as much as we enjoyed talking.
Taking advantage of a recent trip to Washington, DC, I had the privilege of visiting three non-profit organizations who are leaders in the application of computers to changing society. First, I attended the annual meeting of the Association for Computing Machinery's US Public Policy Council (USACM). Several members of the council then visited the Open Technology Institute (OTI), which is a section of New America Foundation (NAF). Finally, I caught the end of the first general-attendance meeting of the Open Source Initiative (OSI). In different ways, these organizations are all putting in tremendous effort to provide the benefits of computing to more people of all walks of life and to preserve the vigor and creativity of computing platforms. I found out through my meetings what sorts of systemic change is required to achieve these goals and saw these organizations grapple with a variety of strategies to get there. This report is not a statement from any of these groups, just my personal observations. USACM PUBLIC POLICY COUNCIL The Association for Computing Machinery (ACM) has been around almost as long as electronic computers: it was founded in 1948. I joined in the 1980s but was mostly a passive recipient of information. Although last week's policy meeting was the first I had ever attended at ACM, many of the attendees were familiar to me from previous work on either technical or policy problems. As we met, open data was in the news thanks to a White House memo reiterating its call for open government data in machine-readable form. Although the movement for these data sets has been congealing from many directions over the past few years, USACM was out in front back in early 2009 with a policy recommendation for consumable data. USACM weighs in on such policy issues as copyrights and patents, accessibility, privacy, innovation, and the various other topics on which you'd expect computer scientists to have professional opinions. I felt that the group's domestic scope is oddly out of sync with the larger ACM, which has been assiduously expanding overseas for many years. A majority of ACM members now live outside the United States. In fact, many of today's issues have international reach: cybersecurity, accessibility, and copyright, to name some obvious ones. Although USACM has submitted comments on ACTA and the Trans-Pacific Partnership, they don't maintain regular contacts work with organizations outside the country. Perhaps they'll have the cycles to add more international connections in the future. Eugene Spafford, security expert and current chair of the policy committee, pointed out that many state-level projects in the US would be worth commenting on as well. It's also time to recognize that policy is made by non-governmental organizations as well as governments. Facebook and Google, for example, are setting policies about privacy. The book _The End of Power: From Boardrooms to Battlefields and Churches to States, Why Being In Charge Isn't What It Used To Be_ by Moisés Naím claims that power is becoming more widely distributed (not ending, really) and that a bigger set of actors should be taken into account by people hoping to effect change. USACM represents a technical organization, so it seeks to educate policy decision-makers on issues where there is an intersection of computing technology and public policy. Their principles derive from the ACM Code of Ethics and Professional Conduct, which evolved from input by many ACM members and the organization's experience. USACM papers usually focus on pointing out the technical implications of legislative or regulatory choices. When the notorious SOPA and PIPA bills came up, for instance, the USACM didn't issue the kind of blanket condemnation many other groups put out, supported by appeals to vague concepts such as freedom and innovation. Instead, they put the microscope to the bills' provisions and issued brief comments about negative effects on the functioning of the Internet, with a focus on DNS. Spafford commented, "We also met privately with Congressional staff and provided tutorials on how DNS and similar mechanisms worked. That helped them understand why their proposals would fail." OPEN TECHNOLOGY INSTITUTE NAF is a flexible and innovative think tank proposing new strategies for dozens of national and international issues. Mostly progressive, in my view, it is committed to considering a wide range of possible solutions and finding rational solutions that all sides can accept. On computing and Internet issues, it features the Open Technology Institute, a rare example of a non-profit group that is firmly engaged in both technology production and policy-making. This reflects the multi-disciplinary expertise of OTI director Sascha Meinrath. Known best for advocating strong policies to promote high-bandwidth Internet access, the OTI also concerns itself with the familiar range of policies in copyright, patents, privacy, and security. Google executive chairman Eric Schmidt is chair of the NAF board, and Google has been generous in its donations to NAF, including storage for the massive amounts of data the OTI has collected on bandwidth worldwide through its Measurement Lab, or M-Lab. M-Lab measures Internet traffic around the world, using crowdsourcing to produce realistic reports about bandwidth, chokepoints, and other aspects of traffic. People can download the M-Lab tools to check for traffic shaping by providers and other characteristic of their connection, and send results back to M-Lab for storage. (They now have 700 terabytes of such data.) Other sites offer speed testing for uploads and downloads, but M-Lab is unique in storing and visualizing the results. The FCC, among others, has used M-Lab to determine the uneven progress of bandwidth in different regions. Like all OTI software projects, Measurement Lab is open source software. OPEN SOURCE INITIATIVE For my last meeting of the day, I dropped by for the last few sessions of Open Source Initiative's Open Source Community Summit and talked to Deborah Bryant, Simon Phipps, and Bruno Souza. OSI's recent changes represent yet another strategy for pushing change in the computer field. OSI is best known for approving open source licenses and seems to be universally recognized as an honest and dependable judge in that area, but they want to branch out from this narrow task. About a year ago, they completely revamped their structure and redefined themselves as a membership organization. (I plunked down some cash as one of their first individual members, having heard of the change from Simon at a Community Leadership Summit). When they announced the summit, they opened up a wiki for discussion about what to cover. The winner hands down was an all-day workshop on licensing -- I guess you can tell when you're in Washington. (The location was the Library of Congress.) They also held an unconference that attracted a nice mix of open-source and proprietary software companies, software uses, and government workers. I heard working group summaries that covered such basic advice as getting ownership of the code that you contract out to companies to create for you, using open source to attract and retain staff, and justifying the investment in open source by thinking more broadly than the agency's immediate needs and priorities. Organizers used the conference to roll out Working Groups, a new procedure for starting projects. Two such projects, launched by members, are the development of a FAQ and the creation of a speaker's bureau. Anybody with an idea that fits the mission of promoting and adopting open source software can propose a project, but the process requires strict deadlines and plans for fiscally sustaining the project. OSI is trying to change government and society by changing the way they make and consume software. USACM is trying to improve the institutions' understanding of software as well as the environment in which it is made. NAF is trying to extend computing to everyone, and using software as a research tool in pursuit of that goal. Each organization, starting from a different place, is expanding its options and changing itself in order to change others.
* Australian Filter Scope Creep -- _The Federal Government has confirmed its financial regulator has started requiring Australian Internet service providers to block websites suspected of providing fraudulent financial opportunities, in a move which appears to also open the door for other government agencies to unilaterally block sites they deem questionable in their own portfolios._ * Embedding Actions in Gmail -- after years of benign neglect, it's good to see Gmail worked on again. We've said for years that email's a fertile ground for doing stuff better, and Google seem to have the religion. (see Send Money with Gmail for more). * What Keeps Me Up at Night (Matt Webb) -- Matt's building a business around connected devices. Here he explains why the category could be owned by any of the big players. In times like this I remember Howard Aiken's advice: _Don't worry about people stealing your ideas. If it is original you will have to ram it down their throats._ * Image Texture Predicts Avian Density and Species Richness (PLOSone) -- _Surprisingly and interestingly, remotely sensed vegetation structure measures (i.e., image texture) were often better predictors of avian density and species richness than field-measured vegetation structure, and thus show promise as a valuable tool for mapping habitat quality and characterizing biodiversity across broad areas._
* Facial Recognition in Google Glass (Mashable) -- this makes Glass umpty more attractive to me. It was created in a hackathon for doctors to use with patients, but I need it wired into my eyeballs. * How to Price Your Hardware Project -- _At the end of the day you are picking a price that enables you to stay in business. As @meganauman says, “Profit is not something to add at the end, it is something to plan for in the beginning.”_ * Hardware Pricing (Matt Webb) -- _When products connect to the cloud, the cost structure changes once again. On the one hand, there are ongoing network costs which have to be paid by someone. You can do that with a cut of transactions on the platform, by absorbing the network cost upfront in the RRP, or with user-pays subscription._ * Dicoogle -- open source medical image search. Written up in PLOSone paper.
* Behind the Banner -- visualization of what happens in the 150ms when the cabal of data vultures decide which ad to show you. They pass around your data as enthusiastically as a pipe at a Grateful Dead concert, and you've just as much chance of getting it back. (via John Battelle) * pwnpad -- Nexus 7 with Android and Ubuntu, high-gain USB bluetooth, ethernet adapter, and a gorgeous suite of security tools. (via Kyle Young) * Terra -- _a simple, statically-typed, compiled language with manual memory management [...] designed from the beginning to interoperate with Lua. Terra functions are first-class Lua values created using the terra keyword. When needed they are JIT-compiled to machine code._ (via Hacker News) * Metaphor Identification in Large Texts Corpora (PLOSone) -- _The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus._
My data's bigger than yours!The big data world is a confusing place. We're no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity. These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn't help.
POPULAR KID: Look at me! Big data is the hotness! HADOOP: My data's bigger than yours! SCIPY: Size isn't everything, Hadoop! The bigger they come, the harder they fall. And aren't you named after a toy elephant? R: Backward sentences mine be, but great power contains large brain. EVERYONE: Huh? SQL: Oh, so you all want to be friends again now, eh?! POPULAR KID: Yeah, what SQL said! Nobody really needs big data; it's all about small data, dummy.The fact is that we're fumbling toward the adolescence of big data tools, and we're at an early stage of understanding how data can be used to create value and increase the quality of service people receive from government, business and health care. Big data is trumpeted in mainstream media, but many businesses are better advised to take baby steps with small data. Data skeptics are not without justification. Our use of "small data" hasn't exactly worked out uniformly well so far, crude numbers often being misused either knowingly or otherwise. For example, over-reliance by bureaucrats on the results of testing in schools is shaping educational institutions toward a tragically homogeneous mediocrity. The promise and the gamble of big data is this: that we can advance past the primitive quotas of today's small data into both a sophisticated statistical understanding of an entire system and insight that focuses down to the level of an individual. Data gives us both telescope and microscope, in detail we've never had before. Inside this tantalizing vision lies many of the debates in today's data world: the need for highly skilled data scientists to effect this change, and the worry that we'll inadvertently enslave ourselves to Big Brother, even with the best of intentions. So, as the data revolution moves forward, it's important to take the long view. The foment of tools and job titles and algorithms is significant, but ultimately it's background to our larger purposes as people, businesses and government. That's one reason why, at O'Reilly, we've taken the motto "Making Data Work" for Strata. Data, not technology, is the heartbeat of our world because it relates directly to ourselves and the problems we want to solve. This is also the reason that the Strata and Hadoop World conferences take a broad view of the subject: ranging from the business topics to the tools and data science. If you talk to Hadoop's most seasoned advocates, they don't speak only about the tech; they talk about the problems they're able to solve. The tools alone are never enough; the real enabler is the framework of people and understanding in which they're used. Our mission is to help people make sense of the state of the data world and use this knowledge to become both more competitive and more creative. We believe that's best served by creating context in which we think about our use of data as well as serving the growing specialist communities in data. Enjoy the noise and the energy from the growing data ecosystem, but keep your eyes on the problems you want to solve. _The Strata and Hadoop World Call for Proposals is open until midnight EDT, Thursday May 16_.
* Exploiting a Bug in Google Glass -- unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. _The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked "wait, what version of the software does it report in Settings"? When I told him "XE4″ he clarified "XE4, not XE3″, which I verified. He had thought this feature had been removed from the production units._ * Probability Through Problems -- motivating problems to hook students on probability questions, structured to cover high-school probability material. * Connbox -- love the section "The importance of legible products" where the physical UI interacts seamless with the digital device … it's glorious. Three amazing videos. * The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) -- _The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. [...] An implementation of ISMA in Java is freely available_.
* The Remixing Dilemma -- summary of research on remixed projects, finding that _(1) Projects with moderate amounts of code are remixed more often than either very simple or very complex projects. (2) Projects by more prominent creators are more generative. (3) Remixes are more likely to attract remixers than de novo projects._ * Scratch 2.0 -- my favourite first programming language for kids and adults, now in the browser! Downloadable version for offline use coming soon. See the overview for what's new. * State Dept Takedown on 3D-Printed Gun (Forbes) -- _The government says it wants to review the files for compliance with arms export control laws known as the International Traffic in Arms Regulations, or ITAR. By uploading the weapons files to the Internet and allowing them to be downloaded abroad, the letter implies Wilson’s high-tech gun group may have violated those export controls._ * Data Science of the Facebook World (Stephen Wolfram) -- _More than a million people have now used our Wolfram|Alpha Personal Analytics for Facebook. And as part of our latest update, in addition to collecting some anonymized statistics, we launched a Data Donor program that allows people to contribute detailed data to us for research purposes. A few weeks ago we decided to start analyzing all this data…_ (via Phil Earnhardt)
If you have a good memory, you know that I've written about 3D printers. Technically, I grew up with the laser printer; my first computer industry job (part-time while getting an English PhD) was with Imagen, a startup that built the first laser printer that cost under $20,000, then the first that cost under $10,000, then under $7,000, and died a slow death after Apple produced the first that cost under $5000. Now a laser printer costs a few hundred. And I've been cheering as 3D printers followed the the same price curve. But even as I've been cheering, I've had this nagging doubt in the back of my head. So I can 3D-print my own chess set. Cool. So what? Sure, you can do great things with them (enclosures for projects; every DIY-bio lab I've visited has a 3D printer stashed somewhere). While a 3D printer is an important step in bringing 21st-century tooling to the home hacker, they're still fairly limited. Last night, the other shoe dropped. Otherfab, a project of Saul Griffiths' Otherlab, has a new Kickstarter project for Othermill: a home computer-controlled milling machine. A milling machine is a large, versatile beast that uses a high-speed cutting bit to sculpt material (often metal) into the desired shape. Instead of adding layers of plastic or some other material, like a 3D printer, a milling machine cuts material away. If you've ever visited machine shops, you know that milling machines are where the magic happens. Particularly state-of-the-art computer controlled mills. They're big, they're expensive, and they can do just about anything. Putting one in the home shop -- that's revolutionary. A printer combined with a mill (additive and subtractive processes): that's an exciting combination. Otherfab's mill is intended for making custom printed circuit boards; in a home environment, cutting away unneeded copper is much preferable to using acids to etch boards (I've made my own boards, and I know), and gives you more immediate feedback than sending a design off to a fabrication facility. But I don't really think this is about PC boards and electronics. As the Kickstarter points out, their mill can be used to make anything that fits: it cuts metal, wood, wax, and plastic. I can't wait to see what people use it for. And if we're going to get serious about reinventing and re-envisioning manufacturing, home milling machines are essential infrastructure. Othermill reached funding in under 24 hours; they have stretch goals ranging from $100K (already passed) up to a million. It looks like, when they have a commercially available unit, the price will be somewhere around $1500, though I'm just guessing; I'd also guess that the price will continue to drop as it did with 3D printers. I don't know, writing about Kickstarters could end up being too much fun.
I'm a sucker for a good plant tour, and I had a really good one last week when Jim Stogdill and I visited K. Venkatesh Prasad at Ford Motor in Dearborn, Mich. I gave a seminar and we talked at length about Ford's OpenXC program and its approach to building software platforms. The highlight of the visit was seeing the scale of Ford's operation, and particularly the scale of its research and development organization. Prasad's building is a half-mile into Ford's vast research and engineering campus. It's an endless grid of wet labs like you'd see at a university: test tubes and robots all over the place; separate labs for adhesives, textiles, vibration dampening; machines for evaluating what's in reach for different-sized people. Prasad explained that much of the R&D that goes into a car is conducted at suppliers-Ford might ask its steel supplier to come up with a lighter, stronger alloy, for instance-but Ford is responsible for integrative research: figuring out how to, say, bond its foam insulation onto that new alloy. In our more fevered moments, we on the software side of things tend to foresee every problem being reduced to a generic software problem, solvable with brute-force computing and standard machinery. In that interpretation, a theoretical Google car operating system-one that would drive the car and provide Web-based services to passengers-could commoditize the mechanical aspects of the automobile. If you're not driving, you don't care much about how the car handles; you just want a comfortable seat, functional air conditioning, and Web connectivity for entertainment. A panel in the dashboard becomes the only substantive point of interaction between a car and its owner, and if every car is running Google's software in that panel, then there's not much left to distinguish different makes and models. When's the last time you heard much of a debate on Dell laptops versus HP? As long it's running the software you want, and meets minimum criteria for performance and physical quality, there's not much to distinguish laptop makers for the vast majority of users. The exception, perhaps, is Apple, which consumers do distinguish from other laptop makers for both its high-quality hardware and its unique software. That's how I start to think after a few days in Mountain View. A trip to Detroit pushes me in the other direction: the mechanical aspects of cars are enormously complex. Even incremental changes take vast re-engineering efforts. Changing the shape of a door sill to make a car easier to get into means changing a car's aesthetics, its frame, the sheet metal that gets stamped to make it, the wires and sensors embedded in it, and the assembly process that puts it together. Everything from structural integrity to user experience needs to be carefully checked before a thousand replicates start driving out of Ford's plants every day. So, when it comes to value added, where will the balance between software and machines emerge? Software companies and industrial firms might both try to shift the balance by controlling the interfaces between software and machines: if OpenXC can demonstrate that it's a better way to interact with Ford cars than any other interface, Ford will retain an advantage. As physical things get networked and instrumented, software can make up a larger proportion of their value. I'm not sure exactly where that balance will arise, but I have a hard time believing in complete commoditization of the machines beneath the software. _See our free research report on the industrial internet for an overview of the ways that software and machines are coming together._
* On Google's Ingress Game (ReadWrite Web) -- _By rolling out Ingress to developers at I/O, Google hopes to show how mobile, location, multi-player and augmented reality functions can be integrated into developer application offerings. In that way, Ingress becomes a kind of “how-to” template to developers looking to create vibrant new offerings for Android games and apps._ (via Mike Loukides) * Nanoscribe Micro-3D Printer -- _in contrast to stereolithography (SLA), the resolution is between 1 and 2 orders of magnitude higher: Feature sizes in the order of 1 µm and less are standard._ (via BoingBoing) * Thingpunk -- _The problem of the persistence of these traditional values is that they prevent us from addressing the most pressing design questions of the digital era: How can we create these forms of beauty and fulfill this promise of authenticity within the large and growing portions of our lives that are lived digitally? Or, conversely, can we learn to move past these older ideas of value, to embrace the transience and changeability offered by the digital as virtues in themselves? Thus far, instead of approaching these (extremely difficult) questions directly, traditional design thinking has lead us to avoid them by trying to make our digital things more like physical things (building in artificial scarcity, designing them skeumorphically, etc.) and by treating the digital as a supplemental add-on to primarily physical devices and experiences (the Internet of Things, digital fabrication)._ * Kickstarter and NPR -- _The internet turns everything into public radio._ There's a truth here about audience-supported media and the kinds of money-extraction systems necessary to beat freeloading in a medium that makes money-collection hard and freeloading easy.
* How to Build a Working Digital Computer Out of Paperclips (Evil Mad Scientist) -- from a 1967 popular science book showing _how to build everything from parts that you might find at a hardware store: items like paper clips, little light bulbs, thread spools, wire, screws, and switches (that can optionally be made from paper clips)._ * Moloch (Github) -- _an open source, large scale IPv4 packet capturing (PCAP), indexing and database system_ with a simple web GUI. * Offline Wikipedia Reader (Amazon) -- genius, because what Wikipedia needed to be successful was to be read-only. (via BoingBoing) * Storing and Publishing Sensor Data -- rundown of apps and sites for sensor data. (via Pete Warden)
Mike Loukides recently recapped a conversation we'd had about leading indicators for data science efforts in an organization. We also pondered where the role of data scientist is headed and realized we could treat software development as a prototype case. It's easy (if not eerie) to draw parallels between the Internet boom of the mid 1990s and the Big Data boom of the present day: in addition to the exuberance in the press and the new business models, a particular breed of technical skill became a competitive advantage and a household name. Back then, this was the software developer. Today, it's the data scientist. The time in the sun improved software development in some ways, but it also brought its share of problems. Some companies were short on the skill and discipline required to manage custom software projects, and they were equally ill-equipped to discern the true technical talent from the pretenders. That combination led to low-quality software projects that simply failed to deliver business value. (A number of these survive today as "repair-ware" that requires constant, expensive upkeep.) How will the data science field avoid software development's pitfalls? (As an aside, we shudder to think what would be the data science equivalent of "repair-ware.") We started to explore some ideas but realized they were all rooted in education, business value, and openness: Company leaders must educate themselves in order to understand how data analysis can improve their firm. That knowledge will guide them in building out a data science team and establishing its mission. Leaders otherwise risk trivializing the data scientist role or overindulging in analytics for the sake of analytics. In turn, data scientists must understand how their work is meant to improve the business. That will serve as their compass when they explore new ideas so they can aim to deliver solid value. Without that guidance, it's too easy to get stuck in rabbit-holes and yak-shaving. Both parties must be vigilant of any needless barriers forming around the data science team, especially after the initial novelty fades. Open communication between the data science group and the rest of the business will ensure the former doesn't land in a separate silo, marginalized out of the company mission. These ideas may be a start, but are they enough? Probably not. What would you recommend to steer data science clear of pitfalls?
* Raspberry Pi Wireless Attack Toolkit -- _A collection of pre-configured or automatically-configured tools that automate and ease the process of creating robust Man-in-the-middle attacks. The toolkit allows your to easily select between several attack modes and is specifically designed to be easily extendable with custom payloads, tools, and attacks. The cornerstone of this project is the ability to inject Browser Exploitation Framework Hooks into a web browser without any warnings, alarms, or alerts to the user. We accomplish this objective mainly through wireless attacks, but also have a limpet mine mode with ettercap and a few other tricks._ * Industrial Robot with SDK For Researchers (IEEE Spectrum) -- $22,000 industrial robot with 7 degrees-of-freedom arms, _integrated cameras, sonar, and torque sensors on every joint. [...] The Baxter research version is still running a core software system that is proprietary, not open. But on top of that the company built the SDK layer, based on ROS (Robot Operation System), and this layer is open source. In addition, there are also some libraries of low level tasks (such as joint control and positioning) that Rethink made open._ * OtherMill (Kickstarter) -- _An easy to use, affordable, computer controlled mill. Take all your DIY projects further with custom circuits and precision machining._ (via Mike Loukides) * go-raft (GitHub) -- open source implementation of the Raft distributed consensus protocol, in Go. (via Ian Davis)
I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you're in the New York area, and you're interested in seeing data used honestly, stop by! That announcement pushed me to write another post about data skepticism. The past few days, I've seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I'm worried. (And I'm not vain enough to think it's a response to my first post about skepticism; it's more likely an effect of Cukier's book.) There's a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can't tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that's subject to entropy. Let's do some thought experiments-unfortunately, totally devoid of data. But I don't think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there's a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let's take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. And here we see what's really happening: to remove one head of the double-headed arrow, we use "common sense" to choose between two stories: one that's merely silly, and another that's so ludicrous we never even think about it. Seems to work here (for a very limited value of "work"); but if I've learned one thing, it's that good old common sense is frequently neither common nor sensible. For more realistic correlations, it certainly seems ironic that we're doing all this data analysis just to end up relying on common sense. Now let's look at something equally hypothetical that isn't silly. A drug is correlated with reduced risk of death due to heart failure. Good thing, right? Yes-but why? What if the drug has nothing to do with heart failure, but is really an anti-depressant that makes you feel better about yourself so you exercise more? If you're in the "correlation is as good as causation" club, doesn't make a difference: you win either way. Except that, if the key is really exercise, there might be much better ways to achieve the same result. Certainly much cheaper, since the drug industry will no doubt price the pills at $100 each. (Tangent: I once saw a truck drive up to an orthopedist's office and deliver Vioxx samples with a street value probably in the millions…) It's possible, given some really interesting work being done on the placebo effect, that a properly administered sugar pill will make the patient feel better and exercise, yielding the same result. (Though it's possible that sugar pills only work as placebos if they're expensive.) I think we'd like to know, rather than just saying that correlation is just as good as causation, if you have a lot of data. Perhaps I haven't gone far enough: with enough data, and enough dimensions to the data, it would be possible to detect the correlations between the drug, psychological state, exercise, and heart disease. But that's not the point. First, if correlation really is as good as causation, why bother? Second, to analyze data, you have to collect it. And before you collect it, you have to decide what to collect. Data is socially constructed (I promise, this will be the subject of another post), and the data you don't decide to collect doesn't exist. Decisions about what data to collect are almost always driven by the stories we want to tell. You can have petabytes of data, but if it isn't the right data, if it's data that's been biased by preconceived notions of what's important, you're going to be misled. Indeed, any researcher knows that huge data sets tend to create spurious correlations. Causation has its own problems, not the least of which is that it's impossible to prove. Unfortunately, that's the way the world works. But thinking about cause and how events relate to each other helps us to be more critical about the correlations we discover. As humans we're storytellers, and an important part of data work is building a story around the data. Mere correlations arising from a gigantic pool of data aren't enough to satisfy us. But there are good stories and bad ones, and just as it's possible to be careful in designing your experiments, it's possible to be careful and ethical in the stories you tell with your data. Those stories may be the closest we get ever get to an understanding of cause; but we have to realize that they're just stories, that they're provisional, and that better evidence (which may just be correlations) may force us to retell our stories at any moment. Correlation is as good as causation is just an excuse for intellectual sloppiness; it's an excuse to replace thought with an odd kind of "common sense," and to shut down the discussion that leads to good stories and understanding.