From counting steps with a smartwatch to visiting this website, nearly everything we do generates data. But just collecting statistics, measurements and other numbers and storing the information is not enough. How we harness data is the key to success in our digital world.
What Is Data Analysis and Why Is it Necessary?
How many steps you took today doesn’t mean anything unless you know information like how many steps you took yesterday, how many steps you take on average and how many steps you should be taking.
When you gather information, organize it and draw conclusions and insights, then you can make better decisions, improve operations, fine-tune technology and so on. Data analysis includes evaluating and recapping information, and it can help us understand patterns and trends.
Types of Data Analysis
There are four main types of data analysis: descriptive, diagnostic, predictive and prescriptive. These data analysis methods build on each other like tiers of a wedding cake.
Descriptive Data Analysis
Descriptive statistics tell you what is in the data you’ve gathered. Building blocks include how many data points you have, average and median measurements, the amount of variation within your data, and the certainty those things provide about your results.
Diagnostic data analysis – also called causal analysis – examines the relationships among data to uncover possible causes and effects. To accomplish this, you might look for known relationships to explain observations or use data to identify unknown relationships.
Building on diagnostic data analysis is predictive analysis, where you use those relationships to generate predictions about future results. These “models” can range from equations in a spreadsheet to applications of artificial intelligence requiring vast computing resources.
Predictive modeling is the heart of analysis, says Nick Street, professor of business analytics and associate dean for research and Ph.D. programs at the University of Iowa’s Tippie College of Business.
“My poll needs to be correct about the people who are going to vote, and my self-driving car has to be correct about whether that’s a stop sign or not,” Street says.
Prescriptive Data Analysis
Often, the goal of data analysis is to help make sound decisions. While all types of data analysis can help you accomplish this, prescriptive data analysis provides a deeper understanding of costs, benefits and risks. Basically, prescriptive data analysis helps us answer the question, “What should I do?”
The most common kind of prescriptive analysis is optimization, or figuring out “the best results under the given circumstances,” according to a post at Data Science Central. So, given a set of constraints, which inputs provide the most benefit for the lowest cost and least amount of risk. For example, a particular step in surgery might reduce the risk of infection but increase the risk of other complications.
In Street’s work, data can inform a decision by predicting how likely a patient is to get an infection without the step in surgery that is supposed to reduce infection risk. That way, a doctor could determine whether the extra step is actually beneficial, or if the step could be removed from the surgical process.
Of course, while a data analyst can provide the prescriptive analysis, a doctor would need to interpret the probability and make a decision based on the data.
“I’m not qualified to make that decision,” Street says of a data analyst’s role. “I can just tell you that for this person it’s (63%).”
Data Analysis Tools, Techniques and Methods
Data analysis involves a spectrum of tools and methodologies with overlapping goals, strengths and capabilities. Here is how each working part contributes to effective data analysis.
The Data Analysis Phases
There are different ways of looking at the phases of data analysis. Here is a typical framework.
You need to know the questions you want to answer and determine what data you require in order to find the answer.
This involves identifying data that might answer your questions, determining what steps are required to gather the data, and understanding what strengths and weaknesses each type of data might present. Not all data is strong or relevant for answering your question.
Charlie McHenry, a partner at consulting firm Green Econometrics, says figuring out which data matters to answer a question might seem difficult, but the information you need is often hiding in plain sight.
For example, consider the data gathered from business systems, surveys and information downloaded from social media platforms. You might also consider purchasing commercial data or using public datasets.
“Every enterprise has a fire hose of collectable data,” McHenry says.
This is the most delicate stage of data analysis, and it often takes the most time to accomplish. All data comes in “dirty,” containing errors, omissions and biases. While data doesn’t lie, accurate analysis requires identifying and accounting for imperfections.
For example, lists of people often contain multiple entries with different spellings. The same person might appear with the names Anne, Annie and Ann. At least one of those is misspelled, and treating her as three separate people is always incorrect.
The meatiest phase is applying descriptive, diagnostic, predictive and prescriptive analysis to the data. At first, the results may be baffling or contradictory, but always keep digging.
Just be vigilant and look for these common errors:
- False positives that seem important but are actually coincidental.
- False negatives, which are important relationships that are hidden by dirty data or statistical noise.
- Lurking variables, where an apparent relationship is caused by something the data didn’t capture.
This stage is where a data analyst must practice careful judgment and has the most chance to be wrong. It’s up to an analyst to determine which models, statistics and relationships are actually important.
Then the data analyst must understand and explain what the models do and do not mean. For instance, political scientists and journalists often build models to predict a presidential election by using polls. In 2008 and 2012, those models correctly predicted the results. In 2016, those models showed lower levels of certainty, and the candidate they said was more likely to win did not. By ignoring the change in certainty, many people were shocked by the election results, falling prey to confirmation bias because they only saw data that supported their beliefs about who would win.
Staring at equations and columns of numbers is not appealing to many people. That’s why a data analyst has to make the numbers “friendly” by transforming data into visuals like charts and graphs. Modern data visualization takes this a step further and includes digital graphics and dashboards of interrelated charts that people can explore online.
Data Analysis Tools
While there are countless tools for each phase of data analysis, the most popular tools break down in the following way:
- SurveyMonkey: Do you need to collect data from your users or customers? There are many tools for online surveys, but SurveyMonkey is popular with analysts for its ease of use, features and capabilities. You can apply it to survey all users, only a random portion or a sample of the public.
- Data.world: There is a lot of data already out there, much more than any person can find just by searching the web. While data.world’s primary emphasis is allowing companies to host and analyze their own data in the cloud, its community portal has a rich set of datasets you can use. Other go-to data collections include: FRED for economic data, ESRI ArcGIS Online for geographic data and the federal government’s Data.gov.
- Google Analytics: Google produces a tool for tracking users online. If you have a website, you can use this free tool to measure virtually any aspect of user behavior. Competitors include Adobe Marketing Cloud, Open Web Analytics and Plausible Analytics.
- Microsoft Excel: The Swiss Army knife of data analysis, current versions of the Microsoft Excel spreadsheet can store up to 1 million rows of data. It also has basic tools for manipulating and visualizing data. Excel is available in desktop, mobile and online versions. Competitors include Google Sheets, Apple’s Numbers and Apache OpenOffice.
- PostgreSQL: One of the most popular of the traditional database systems, PostgreSQL can store and query gigabytes of information split into “tables” for each kind of data. It has the SQL language built in (see below), can be used locally or in the cloud, and can be integrated with virtually any programming language. Competitors include Microsoft SQL Server, Microsoft Access and MySQL.
- MongoDB: This is a popular “nonrelational” database. MongoDB combines data so that all the information related to a given entity, such as customers, is stored in a single collection of nested data. Competitors include Apache CouchDB, Amazon DynamoDB and Apache HBase.
Of course, gathering and storing data aren’t enough. Data analysis involves tools to clean data, then transform it, summarize it and develop models from it.
- SQL: The go-to choice when your data gets too big or complex for Excel, SQL is a system for writing “queries” of a database to extract and summarize data matching a particular set of conditions. It is built into relational database programs and requires one to work. Each database system has its own version of SQL with varying levels of capability.
- R: R is the favored programming language of statisticians. It is free and has a large ecosystem of community-developed packages for specific analytical tasks. It especially excels in data manipulation, data visualization and calculations, while being less used for advanced techniques requiring heavy computation.
- Python: Python is the second-most-popular programming language in the world. It is used for everything from building websites to operating the International Space Station. In data analysis, Python excels at advanced techniques like web scraping (automatically gathering data from online sources), machine learning and natural language processing.
- Tableau: Analysts swear by this desktop program’s compatibility with nearly any data source, ability to generate complex graphics, and capability of publishing interactive dashboards that allow users to explore the data for themselves.
- Google Data Studio: Similar in some ways to Tableau, this is a web-based tool that focuses on ease of use over complex capabilities. It’s strongly integrated with other Google products, and many say it produces the best-looking results out of the box.
- Microsoft Power BI: No list of data visualization tools would be complete without Microsoft Power BI. It’s tightly linked with Microsoft’s desktop, database and cloud offerings, and focuses on allowing users to create their own dashboards and visualizations.
Left flowing, the “fire hose” of data McHenry describes quickly overwhelms most databases. Where can you store a clearinghouse of information? Here are some options:
- Oracle Database: Known as “Big Red,” Oracle is famed for its ability to scale vast quantities of data. Oracle Database allows users to store and analyze big data using familiar database formats and tools like SQL.
- Amazon Redshift: Amazon Redshift is pitched as a more affordable alternative to Oracle Database. As part of Amazon Web Services, it integrates well with their other services, but it can only be used as part of the AWS cloud offerings.
- Domo: Domo combines the capabilities of a data warehouse like Oracle or Amazon Redshift with a functionality similar to Microsoft Power BI. It is used by organizations that want to allow many employees to gain access to a data warehouse.
Example of Data Analysis at Work
Putting together all the pieces of the data analysis puzzle might seem complex, but the time and resources required are worth the gains, says Pentti Tofte, vice president and head of analytics at the property insurer FM Global.
FM’s goal is not just to set insurance rates, but also to help customers reduce them, Tofte says. His inspectors visit more than 100,000 properties every year and record more than 700 pieces of data. Combining that information with data related to risks like fires and hurricanes, FM can then provide recommendations to the companies it insures.
“We believe most loss is preventable,” Tofte says. “We use data to tell them what losses to expect where and which vulnerabilities to prioritize.”
How Does Data Analysis Relate to Other Data and Business Functions?
Data analysis exists as a continuum of techniques, three of the most common being data analytics, data science and data mining.
Data Analysis vs. Data Analytics
Some people use these terms interchangeably. Data analysis also is often considered to be a subset of data analytics. Generally, data analytics covers a forward-looking outlook, or predicting future actions or results.
Data Analysis vs. Data Science
Data science takes analysis a step further by applying techniques from computer science to generate complex models that take into account large numbers of variables with complex (and sometimes poorly understood) interrelationships.
Data Analysis vs. Data Mining
Data mining goes even deeper by automating the process of discovery. Software is developed to find relationships and build models from extremely large datasets. Data mining is extremely powerful, but the resulting models require extensive evaluation to ensure they are valid.
How to Sharpen Your Data Analysis Skills
So you want to learn more about data analysis, but where to start? There is no right answer for everyone. And with such a large topic, don’t expect shortcuts. Here are a few places to get started.
If you never took a statistics class, it’s time to read The Cartoon Guide to Statistics. While it’s no replacement for a semester-long class, it’s more than enough to get you started.
Speaking of classes, there are some very good options for free online. Coursera, Udacity and Khan Academy offer relevant classes for free, although some features may require a paid upgrade. As you get more advanced, you can access a library of great tutorials at KDNuggets.
To get started right now, check out YouTube, where you will find a nearly never-ending collection of videos on data analysis. I highly recommend tuning in to The Ohio State University professor and Nobel Fellow Bear Braumoeller’s online lectures that address data literacy and visualization.