You probably can’t imagine your life without the Internet. And, you’re not alone.
Today, 59% of the global population is using it. Services like Google, YouTube, Facebook, and Amazon have made us the generation that produces the most data ever.
In fact, humans are producing 2.5 quintillion bytes of data every day. What is more, we’ve created 90% of the world’s data during the past two years. That’s astonishing!
Data contributes to easier communication between people. It helps us get a better understanding of the world around us.
Okay, so we have all these vast amounts of data now. What do we do with it? Data science is the answer. The data science industry has been booming during the past several years, and so has the data science job market.
Harvard Business Review called the data scientist role “the sexiest job of the 21st century,” predicting the sector’s flourishing. What is more, the US Bureau of Labour Statistics projects 11.5 million data science job openings by 2026.
You know how you were looking at hotels in Portugal, and then you started seeing Booking ads about accommodation in Portugal on every website you visited? Well, you’re seeing them thanks to data science.
Data science is complicated enough that one clear description is hard to provide. When data gets more complicated and dynamic, so does the concept of data science. We can refer to data science as the process of gathering, storing, interpreting, and using data to make strategic decisions.
Many enterprises that collect large amounts of data have hired data scientists to help them get a better understanding of it. Moreover, they enable them to provide their clients with a personalized experience and make smarter decisions regarding their offering. Data science lies in every company’s foundation that leverages artificial intelligence (AI) and machine learning algorithms.
In its most simple terms, data science can be described as giving value to raw data. In fact, data science is developing so rapidly and has already demonstrated such an immense set of possibilities that understanding it needs broader perspectives to describe it better.
And while a particular concept is difficult to pin down, it’s very easy to see and sense the effects of data science all around us. Data science can lead to amazing new discoveries when applied to numerous fields.
In this course, we’re offering a comprehensive introduction to data science and its basic concepts. By the end of the course, you’ll be able to understand its role in interpreting data all around us.
LESSON 1: An introduction to data science
1.1 What is data science?
To most people, data science sounds like something very complicated. When they think of data science, they see a vast amount of numbers, which seems rather chaotic.
And, they aren’t far from the truth. However, data science isn’t as complex as it sounds. To put it simply, data science represents a set of methods, algorithms, and machine learning practices for processing thousands of data points in different forms to drive valuable insights that contribute to making smarter decisions.
Even though we aren’t always aware of it, we leave data every day. Our calls, payments, emails, likes, clicks—they all represent data points collected by various types of software. Tech giants collect them so they can improve their services and predict our future moves.
Data can be used in countless ways. It can show information about a certain moment through a real-time dashboard, which is the case of energy consumption. For example, when energy consumption data of a certain home appliance is different than usual, it could mean that this appliance isn’t working well. This means that data can help us identify anomalies.
Your social media data, on the other hand, can reflect your behaviour. The pages and products you looked at on Instagram tells this social network what your interests are, which leads to a more personalized feed and targeted ads. Netflix also collects your preferences data to show you content you’re more likely to watch.
One very important function of data is that it can help us predict future events, like the size of a population or the increase in energy demand. To make results more precise, data scientists use mathematical formulas which uncover the level of probability for a certain event to happen.
In essence, data science is available to make choices and forecast future events. Data scientists use predictive and prescriptive analytics and machine learning algorithms to do this. Their job is to find new perspectives on the same data.
As we’re collecting more data than ever before, data science is necessary in today’s world.
1.2 Why is data science important?
Raw data is useless without the abilities of experts who transform cutting-edge technologies into implementable observations. More and more businesses are becoming aware of the strength of data.
This increases the importance of data scientists who understand how to derive actionable knowledge from terabytes of data points. Many experts even believe that data science can save the world. It’s already been used in actions to prevent deforestation, monitor biodiversity, and save the oceans.
In the business world, data helps entrepreneurs start companies, build better products, and bring them to the market faster. By leveraging available digital data, they can reach a significant number of users in no-time.
The potential of an organization to succeed is now measured by how easily analytics are applied to large, unstructured data sets from diverse channels to accelerate innovation. This makes data scientists profiles of high demand, as people who are crucial to product success.
Let’s see some of the ways that companies can leverage data science in their favour:
Review the company’s performance. Assessing the performance of a company or its products is one of the most basic roles of data science. If there are goals to be tracked, data scientists work proactively to identify best performers, recognize the factors of metric changes, and create an analysis that visualizes the changes.
Design customer-centric products. A lot of organizations conduct surveys and experiments and design products based on customer feedback. Data scientists help in creating these experiments and defining data-informed theories. Moreover, they lead the development team through data insights to continuously improve the product.
Predict events and outcomes. Data scientists design machine learning models that get trained to predict future trends. These forecasts can help companies prepare for what’s coming.
Create product strategies. Thanks to data science, companies can perform an in-depth analysis of the customer journey to extract actionable information that can contribute to the overall product strategy. A data-driven strategy guarantees the biggest product success.
Identify opportunities. Data scientists challenge internal procedures and conclusions through their contact with the current analytical framework of the company to design additional approaches and analytical algorithms. Their work allows them to constantly identify new success opportunities for their company.
Hiring the right employees. Data science experts can navigate their way through a vast amount of data points to identify the applicants who best match the organization’s needs given the amount of knowledge accessible on talent through social media, organizational directories, and career search platforms.
Prevent fraud and data breaches. Data scientists are qualified to classify data that is out of the usual norm. For predictive fraud probability models, they build mathematical, network, route, and big data strategies and use them to develop warnings that appear when unusual data is identified.
1.3 What does a data scientist do?
To address complex computing challenges, data scientists are those who exploit their extensive expertise in particular technological fields. They combine statistics, simulation, mathematics, computer science, artificial intelligence, and other processes.
To find answers and make choices important to the development and success of the business, data scientists take advantage of new technologies. During this process, they analyze raw data from both structured and unstructured sources, turning it into relevant insights.
The data-related task isn’t always clearly set. The data scientist has to turn the problem into concrete and observable scenario, find out how to solve it and provide the solution to their managers.
These are the steps they go through:
- Describe the problem. What is it you’ve got to deliver? How is that meant to improve your business? How do you turn the ambiguous problem into a straightforward and understandable challenge? These are the questions data scientists have to answer.
- Collect the data related to the problem. Then, they find the resources that will provide you with the data they need. They also need to decide what kind of data is important for the particular problem and estimate the time and commitment they need to gather it.
- Process the data. Raw data can be very chaotic and full of unclarities. This is when the data scientist cleans the data and turns it into understandable information.
Analyze the data. In this step, the data scientist has to look for patterns in the data and identify the most relevant ones.
- Document forecasts. To perform this, the data scientist leverages machine learning algorithms and statistical models. They analyze the patterns and generate predictions based on them.
- Present the results. Finally, the data scientist presents their findings in front of their stakeholders in a language they can understand.
This isn’t a standard process all companies use. Some companies allow data scientists to look for problems independently. This requires a lot of experience in the company and a lot of time spent analyzing company data.
1.4 Data science vs. data analysis
Don’t mix up data science and data analysis. Although the terms are used interchangeably by many individuals, data science and big data analytics are distinct domains, with the range being the main distinction. For a category of fields that are used to mine vast databases, data science is the better choice.
A more oriented variant of this is data analytics software which can also be considered part of the broader process. Data analysis is committed to the discovery of actionable knowledge that, based on current questions, can be implemented immediately.
The topic of discovery is another crucial contrast between the two areas. Data science isn’t concerned with addressing basic questions. Instead, what it does is parsing across large databases is often unstructured forms to reveal insights.
Data analysis performs best in more specific cases, with questions in mind that require answers based on current data. Data science provides wider perspectives that reflect on which questions to be posed, whereas big data analytics focuses on the exploration of more focused responses.
Being a data scientist is much broader than being a data analyst, with some overlapping concepts. To find useful insights, the data analyst typically analyzes the history of the data. On the other hand, depending on the gathered data, the data scientist analyzes the data, detects patterns, and uses machine learning methods to forecast future events.
1.5 Data science vs. business intelligence (BI)
Data science often gets mixed up with business intelligence (BI). Although they both revolve around data, the processes and the purposes of the two concepts have crucial differences.
To find observations to identify market patterns, BI essentially analyzes the past data. It helps companies to compile, plan, run queries, and create dashboards to address questions such as quarterly sales review or market concerns from external and internal sources. BI can assess the impact of certain outcomes soon. The goals include gaining a deeper awareness of the market, identifying new market opportunities, improving procedures of the business, and gaining competitive advantages.
Data science looks further ahead. It’s an investigative way to evaluate historical or existing data and to anticipate potential findings to make better choices. It asks open-ended questions about the occurrence of events.
The differences also lie in the tools and resources. While BI leverages structured data forms like SQL, data science processes both structured and unstructured data sources. To process structured data sources, BI uses statistics and visualization, while data science requires more complex models like machine learning, NLP, graph analysis, etc.
When it comes to tools, BI leverages Microsoft BI, Pentaho, QlikView, and R. On the other hand, data science uses BigML, RapidMiner, Weka, R, and others.
LESSON 2: Key components of data science
2.1 Data preparation
Before collection and analysis, data preparation is the method of cleaning and converting raw data. This is a vital phase that sometimes includes reformatting data or merging datasets to enrich data.
For data scientists or analysts, data preparation is often a tedious task, but it is important to place data in perspective and to convert it into information, removing problems that could arise from low data quality.
One of the key objectives of data preparation is to ensure the quality and consistency of raw data for further processing and analysis so that the findings of BI and analytics applications are accurate. In general, raw data is full of incomplete data points, inconsistencies, or other mistakes.
Furthermore, different data sets also have multiple formats that need to be equated. A major part of the data processing process is the correction of data mistakes, checking data integrity, and combining data sets.
To ensure that they have the information that data scientists or enterprise customers are searching for, data preparation often means identifying appropriate data to be used in during the analysis. To make it more insightful and usable, the data should also be enhanced and optimized. Data scientists do this by combining internal and external data sets, generating new data fields, removing missing values, and resolving imbalanced data sets that could distort the findings of analytics.
The data preparation process consists of several different steps. The steps are different for different data analysts and scientists, but the process usually includes the following tasks:
- Collect data. Operational networks, data centres, and other data repositories capture useful data. During this point, data experts and professionals that are collecting data have to check that the information is a good match for the purposes of the proposed applications.
- Explore data. To truly comprehend what it includes and what needs to be done to prepare it for the planned purposes, data professionals have to discover the collected data. Data profiling aims to recognize similarities in data collections, discrepancies, irregularities, incomplete data, and other characteristics and challenges so that concerns may be fixed.
- Cleanse the collected data. In this stage, to build full and reliable data sets that are ready to be processed and evaluated, data professionals address the identified errors. Defective data is, for instance, discarded or fixed, missed values are filled in and conflicting records are aligned.
- Structure data. After cleansing it, it’s important to structure and organize the data into a cohesive format that meets the specifications of the proposed analytics applications.
- Transform and enrich data. Structured data needs to be transformed to make it coherent and convert it into functional knowledge. To deliver the necessary information, it’s important to enrich the data.
- Validate data. This is a step where the data scientist runs automated processes to check the accuracy and consistency of the data.
- Storing data. When prepared, the data can be processed or directed to a third-party program that prepares the field for processing and analysis.
Statistics are at the heart of data science, standing behind the machine learning algorithms that data scientists use. Statistics is a mathematical discipline that gathers, analyzes, interprets and presents knowledge. In the real world, statistics are used to process complex matters so that data scientists can search for data patterns and changes in data. Statistics can be used in plain terms to extract concrete knowledge from data by running statistical computations on it.
To analyze raw data, construct a statistical model, and assume or forecast the outcome, data scientists apply multiple statistical functions, concepts, and algorithms. The world of statistics influences all aspects of life, including the financial market, life sciences, weather, trade, insurance, and many others.
Here are some fundamental statistics terms:
- The statistical population is a set of related objects or events that are of concern to some topic or experiment. It can represent a collection of actual objects or a possible and theoretically infinite group of objects that data gets collected from.
- A sample is a collection of individuals or objects that a given method extracts or selects from a statistical population.
- A variable is a measured attribute of the individual or the object.
- A statistical model is a mathematical model that encompasses a series of statistical assumptions relating to sample data production. It basically represents the process of data generation.
Data analysis can be:
- Quantitative, when data is collected with numbers and charts that showcase some patterns and changes.
- Qualitative, when text, images, or sound are used instead of numbers.
The two main statistical methods are:
- Descriptive statistics, which uses indexes like mean or standard deviation to summarize sample data. It basically provides a description of the population through elements like numbers, charts, and tables.
- Inferential statistics, which uses a random variation to make conclusions from data. Based on a sample of data collected from the population in question, inferential statistics makes inferences and projections about the population.
Statistical features include the organization of the data, finding minimum and maximum values, finding the median value, and the quartiles’ description. A quartile is a type of quantile that divides the number of data points into four quarters or parts that are approximately equal. Mean, mode, and bias are also considered as statistical features.
Probability theory is also a very important data science concept that refers to a method that can discover the likelihood of some event to happen. A physical condition with a result that can not be expected until it is experienced represents a random experiment. The most common example is tossing a coin.
Probability is a measurable number between zero and one that calculates the likelihood that a given incident will occur. The closer it is to one, the more likely it is to occur. As falling on heads or tails is similarly likely, the probability of tossing a coin is 0.5.
Probability distribution consists of all possible outcomes of a random variable and the associated probability values between zero and one. To measure the possibility of obtaining such values or occurrences, data scientists use probability distributions. The measurement includes value, variance, skewness, and kurtosis.
Dimensionality reduction is the method of reducing the proportions of your data collection. The purpose of this is to solve problems that occur with high-dimensional data sets that do not appear in smaller dimensions.
2.3 Data visualization
Data visualization helps to understand a great amount of data through graphics that are digestible and easy. It makes it easy to decode details for people who aren’t so familiar with data science. Moreover, data visualization can help data scientists organize their data and generate new hypotheses for their next actions.
Here are some of the most commonly used types of visualizations that can help data scientists present their data more understandably:
Bar and pie charts
Bar and pie charts are used to present data that has a fixed number of values, like yes and no, low and high, etc.
Sometimes, the choice between a pie and a bar chart can be tricky. A pie chart should only be used where a significant whole is applied to the sum of the individual components and is designed to visualize how each part relates to the whole. Meanwhile, with a larger variety of data types and not just to break down a whole into bits, data scientists use bar charts.
Too many categorizations can cause the visualization to be overwhelmed. Consider picking the highest values in that case and envision only those.
Histograms are used to present numerical variables of interest, as well as their frequencies. In the histogram, the component of interest is binned into ranges on the x-axis and where we display the frequency of the variables on the y-axis in each bin. Here’s a great histogram example from WallStreetMojo:
Mr A wants to make an investment in the stock market. He has shortlisted the below stocks and wants to know the frequency of the prices.
Solution: We have created a histogram using 5 bins with 5 different frequencies, as seen in the chart below. In Y-axis, it’s the number of stocks falling in that particular category. In X-axis, we have a range of stock prices. For example, the 1st bin range is 100 to 300. And we can note that the count is 7 for that category from the table and as seen in the below graph.
Line and scatter plots
A line plot is a graph that uses a number line to represent results. First, construct a number line that contains all the data set values set to start creating the line plot. Then, on the number line, put an X (or dot) above each specific value. This is the most basic type of chart that’s used in many disciplines.
Line plots are similar to scatter plots, where points are represented with dots or circles instead of with lines. They use Cartesian coordinates to display a set of data, usually through two variables. Using colours can enable you to add one or two more dimensions to the graph. The data is shown as a set of dots, each with the value of one variable deciding the horizontal axis position and the value of the other variable determining the vertical axis position.
A time series is essentially a set of time-ordered data points. Time is usually the independent variable and the purpose is typically to make a prediction for the future. Time plots have a time range set on the x-axis where every dot represents part of a line. They are used to analyze trends, identify rises, and falls in the data over time. The most common time series plot example is the representation of the stock prices.
The Amazon stock price, a chart generated on macrotrends.net
A heat map is a form of data visualization where the individual values in a matrix are displayed by colour variations. In 1991, software designer Cormac Kinney initially invented the word “Heat map” to describe a 2D monitor representing real-time stock market statistics, even though there have been similar visualizations for over a century. Heat maps are useful for showing variations across several variables to show association patterns.
Maps are a very popular data visualization type we use every day. If your data contains longitude and latitude information, or other ways to arrange data geographically (zip codes, area codes, county data, airport data, etc.) maps can add a significant meaning to your visualizations. There are many different types of maps you can use in your visualizations.
2.4 Artificial intelligence (AI)
AI represents the ability of a machine to perform tasks like visual perception, comprehension of speech, decision-making, and language translation.
The aim of AI is to give computers human-like intelligence by using some of its subcategories, like Natural Language Processing (NLP), machine learning, deep learning, etc. Data scientists use AI as an important component that helps them generate insights from data. It helps them solve challenges and to make the best of the available data.
Though data scientists can have their own approaches to how this should be carried out, artificial intelligence is a central factor of this process in one manner or another. To transform the way data is extracted and used in an organization, it may help to simplify processes.
In this context, it’s important to note that the fear of AI taking over the jobs of data scientists is unnecessary. AI is here to improve human skills. Instead of replacing them, it enables data scientists to perform jobs better. Among other aspects, it increases the efficiency of analytics tools, breaks down economic barriers, and strengthens knowledge.
2.5 Machine learning
Machine learning is a subset of AI and the most common association with data science. It’s what everyone thinks of when data science is mentioned. Machine learning is described as the method of using machines to better comprehend a system or process and to reproduce or improve them.
Machines can process data to gain a certain understanding of the underlying structure that produced it. Or, they can process data to generate new structures that can understand the particular data.
Machine learning processes are based on algorithms. Algorithms represent sets of guidelines for a machine to execute some particular operation, which is typically likened to a recipe. With algorithms, you can create many different structures to enable computers to execute different tasks.
Here are the fundamental categories of machine learning:
Supervised learning uses example input-output pairs to map an input to an output. It infers a function consisting of a series of training examples from labelled training data In supervised learning, each example is a pair that contains an input object and a target output value.
The training data is evaluated by a supervised learning algorithm and an inferred function is generated, which can be used for generating different examples. For unseen cases, an optimum scenario would allow the algorithm to correctly decide the class labels. This allows the learning algorithm to kind of rationally generalize from the training data to generate unseen scenarios.
Supervised learning is about predicting things that you’ve witnessed before. You attempt to evaluate whether the result of the process was in the past to construct a framework that attempts to draw out the important facts for the next time it occurs to build forecasts.
For example, supervised learning can be useful when predicting who is going to win a sports event. In this case, the machine will use the results from past similar events. Or, social networks can use this method to decide which ad or content type you are going to click next.
Before we had machine learning, people were forecasting sports results manually, based on the past results of the teams or players playing. However, machine learning can now generate much more accurate predictions by processing larger amounts of data.
Another example of supervised learning is predicting real estate prices. For this, you need data about the estates, including features, square footage, number of rooms, etc. The labels would be the existing prices of the estates. By processing this type of data for a large number of estates, a supervised machine learning model can predict the price of a new real estate.
Supervised learning is split into regression and classification.
Regression uses labelled datasets to learn from them, after what it can forecast the continuous-valued output of the new data presented to the algorithm. Scientists use it when the output is numerical. These are the most commonly used regression algorithms:
- Linear regression, where the relationship between the input and the output and the data they are extracted from is assumed to be linear. The input is an independent variable, while the output is a dependent variable.
- Logistic regression forecasts discrete values for the independent variables by mapping the unseen data to the logic function built into it. With an output between 0 and 1, the algorithm estimates the probability of the new data.
Classification is the type of learning where the algorithm wants to map the new data obtained for each of the two classes from the dataset. Unlike regression, where the result was between 0 and 1, here the result is either 1 or 0. In real life, this kind of learning predicts if something will or will not happen, so the output is yes or no. Here are some of the most popular classification algorithms:
- Decision Tree analyzes the features of the data set to decide which one has the most valuable information, classifying them into branches where the dataset is the root.
- Naive Bayes classify the dataset features independently, usually used in cases of enormous datasets.
- Support Vector Machines are based on the Vap Nik statistical theory, using the Kernel method to classify the two classes.
Unsupervised learning is a form of machine learning that leverages data sets without previously established tags to search for undetected patterns with minimal human supervision. It facilitates the simulation of probability densities over inputs, unlike supervised learning that typically uses human-labelled data.
Unsupervised learning leaves the target outcome behind and uses only the existing data to make predictions. This method of machine learning is less concerned with making predictions than with recognizing and defining connections or correlations within the data that might occur. It determines relationships within the data that’s already available.
For example, with unsupervised learning, you can separate your target customers into segments. You can perform this by using clustering algorithms, an unsupervised learning technique we’ll explain further. You basically segment the data points in such a way that every data point fits into a group based on a certain feature identical to other data points in the same group.
Another example of unsupervised learning we encounter every day is the suggested friends feature on Facebook. They are shown to you based on the number of common friends you have, with your profile being part of a cluster with similar sets of mutual friends.
Clustering and association are the two types of unsupervised learning
Clustering identifies patterns in the datasets, splitting them into clusters based on various features. It can be:
- Hierarchical, where datasets are clustered according to the similarity between the data points.
- K-Means, where the algorithm calculates the centroid of the cluster to create clusters that consist of data points that are as homogeneous as possible. The difference between the centroid and the data points should be minimal, resulting in clusters that can be labelled as they contain very similar data points.
- K-NN, where the algorithm is only activated when a new data point comes in, to classify it based on the datasets stored by it. This algorithm isn’t usable for large datasets, where a lot of new data points come in.
Association is an unsupervised learning type that locates and links the relations of one data item to another data item to help you benefit more. These are the two most commonly used association algorithms:
- Apriori identifies the dependency of a data point from another one, intending to determine what will happen to it if something changes in the data point that it’s dependent. For example, changes in the price of an item can influence the price of its complementary product.
- Frequency Pattern is an algorithm that determines the number of recurring patterns, applies it to a table, then finds the most likely object and sets it as the tree’s base. Then, other items are added based on the determined support. The root of the tree should indicate the item that determines the association.
Reinforcement learning is the machine learning branch that helps algorithms to learn from the consequences of their own choices. This addresses a special kind of issue where decision-making is linear, and the purpose is long-term.
In an unpredictable, possibly complex environment, the agent learns to accomplish a target. This looks a lot like a game where the machine uses trial and error to get to the solution. For the actions it does, the artificial intelligence earns either incentives or fines to get the system to do what the data scientist or the developer wants. Its aim is to optimize the cumulative payout. The goal is to get the biggest total reward possible.
Although supervised and unsupervised learning in general rely on static data and return static results, reinforcement learning involves a complex dataset that communicates with the real world. There are many examples of it, but the most popular one is when in 1996, IBM’s Deep Blue AI won over Gary Kasparov in a game of chess. The computer leveraged reinforcement learning to learn which moves are good and which are bad, playing games and getting better after each game played.
To get a better overview of reinforcement learning, check out this article.
2.6 Deep learning
Deep learning is a type of machine learning that uses Artificial Neural Networks to simulate how a human brain operates.
The method of learning from data over time is integrated into both machine learning and its subcategory, deep learning. Deep learning is a form of machine learning that works best to improve the use of AI and data science, although it is not the only thing linking the two aspects.
The self-driving car is one of the most popular examples. Deep learning teaches the car’s computer to behave as a human driver would. For example, we can easily identify and understand the meaning of a sign on the road. However, a machine needs to learn a lot to gain the ability to do this. Deep learning is basically the process of learning.
By continuously processing data with a given logical structure, deep learning algorithms aim to draw similar conclusions as humans might. Deep learning leverages a multi-layered system of algorithms called neural networks to do this.
A neural network
The architecture of the neural network is inspired by the human brain’s configuration. In our process of thinking, we use our brains to recognize patterns and distinguish multiple information categories.
Deep learning trains neural networks to perform the same data processes. It is also possible to regard the individual layers of neural networks as a sort of filter that operates from vast to precise, raising the potential to identify and generate the right outcome. Just like when the human brain compares new information to what it already knows, neural networks do the same.
Artificial neural networks are making a real revolution in how we use technology. They are basically artificial models of our brain’s biological neurons, representing a graphic representation of numerical values. The numerical values are called weights and they form the relations between the neurons.
The weights make connections stronger as neural networks learn. When there’s a specific task to be performed, we need a particular set of weights to enable the neural network to perform the task. The neural network has to learn the values of the weights and we can’t know them up front.
Thanks to deep learning, Netflix knows what movies to show us next based on our preferences. Alexa and Siri are also something we wouldn’t have without deep learning. Google Translate also offers us more precise translations, as Google switched to neural networks a few years ago.
To find out more about the architecture of artificial neural networks and how they function, check out this article.
2.7 Descriptive, predictive, and prescriptive analytics
Descriptive analytics: What has happened?
Descriptive analytics “describe” unstructured data or simplify it to turn it into something that people can understand. This type of analytics explains what has happened in the past. The past applies to some moments in time that an event took place, regardless of whether it was minutes or years ago.
Descriptive analytics are helpful because they enable us to learn from previous behaviours and to consider how future performance may be affected.
Most of the basic statistics we learned at school and encounter every day are descriptive analytics. The underlying data is typically the amount, or composite, of a filtered data column to which simple math is implemented. Descriptive analytics are useful for displaying values such as overall product balance, average spend per customer, and a yearly movement of the revenue.
Reports that offer historical insights into the development, financials, processes, revenue, financing, inventory and consumers of the business are typical examples of descriptive analytics.
You can use descriptive analytics to outline and explain various aspects of past events that concern your company.
Predictive analytics: What might happen?
Predictive analytics focuses on the future. They deliver actionable data-based insights to enterprises, offering predictions of the chances of a possible result. However, remember that no formula or algorithm can predict what’s going to happen with full certainty. These numbers only show what could happen in the future, generated with methods focused on probability.
Predictive analytics leverage mathematical algorithms on historical data to generate the best estimates for the future. This type of analytics can be used for various company aspects, from predicting consumer activity and purchase habits to detecting changes in sales operations and forecasting the supply chain demand.
For example, banks use predictive analytics to determine whether an individual can perform credit payments regularly and decide whether to give them a credit or not.
Prescriptive analytics: What should we do?
Prescriptive analytics helps data scientists generate a variety of different potential activities towards a solution. These analytics basically provide advice. To advise on potential consequences before the decisions are actually taken, prescriptive analytics try to measure the impact of hypothetical decisions.
They effectively forecast several futures and allow businesses to determine a variety of potential consequences based on their behaviours. A variety of approaches and methods such as business practices, algorithms, machine learning, and statistical modelling procedures are used in prescriptive analytics. These approaches are implemented with various types of data, like past data, real-time streams, and big data.
To maximize production, planning, and stock in the supply chain, larger businesses effectively use prescriptive analytics to ensure that they produce the best items at the right moment and maximize the customer experience.
2.8 Technical knowledge
Combining theory with actual technology is the most interesting part. Here’s some of the technical knowledge data scientists need to possess:
Among other analysis tools, R is the most commonly used one, designed particularly for data science processes. R is a programming language used to compute statistics, interpret data, and display data graphically. Established in the 1990s by Ross Ihaka and Robert Gentlemen, R was developed as an efficient data handling, cleaning, analysis, and distribution statistical network.
43% of data scientists use R, making it one of the leading methods for machine learning, statistics, and data processing, we use R programming. R easily generates objects, features, and packages. It’s open-source and platform-independent. It doesn’t require license installation and can be used on any operating system.
R can also integrate with other languages like C and C++, and leverage various statistical packages and data sources. Its statistical features include basic statistics, graphics, and probability, while its programming aspects involve distributed computing and R packages. Data scientists use this language in a program called R Studio.
One of the most valuable skills required for a career in data science is learning Python. 66% of data scientists stated that they are using Python every day, making it the analytics professionals’ number one language.
Every data science beginner starts with Python. Since it is so extremely flexible, Python is used for many different functions.
Uses include Django web apps and websites, Flask microservices, general PyPI standard library programming projects, PyQt5 or Tkinter GUIs, Java, C, and apps in almost any programming language. It can take different data formats and you can import SQL tables into your code with ease. It enables you to create datasets and on Google, you can basically find every sort of dataset you need.
Of course, with the usual stack of machine learning, Seaborn, Matplotlib, Pandas, and others, Python is also the primary language used for data science. The Anaconda distribution is the most popular technology you can distribute Python from.
Before starting to learn Python, remember that data science is only a small drop in the sea of Python uses. Don’t try to memorize everything immediately, you’ll learn the syntax as you’ll encounter real-world problems and look for solutions. Start by learning basic programming concepts and continue with data science libraries, such as NumPy, Pandas, Matplotlib, Scikit-Learn, Seaborn, etc. Then, you can apply your knowledge to actual projects.
Sometimes, you can face a scenario where the data volume you have exceeds the system’s capacity or where you need to transfer data to various repositories. Apache’s Hadoop is a tool that solves this by enabling you to easily convey data to different points on a system. Also, it can be used for data discovery, filtration, sampling, and other processes.
Hadoop has several components:
- Hadoop Distributed File System (HDFS), which distributes and stores data without the need for network transfer data. The operations occur where the data is stored.
- Map-Reduce (MapR), which processes vast amounts of data through the node cluster.
- Yet Another Resource Manager (YARN), which is used for effective resource management.
Although it isn’t always necessary, getting familiar with Hadoop can be a very useful skill. An analysis performed by CrowdFlower on 3490 data science job ads on LinkedIn listed Apache Hadoop as the second most critical skill for a data scientist.
Data scientists leverage R or Python to operate with DataFrames. However, massive volumes of data can actually not be completely loaded into a DataFrame or even into a .csv format. Instead, they are housed in large databases, like SQL.
SQL(Structured Query Language) is a programming language that is used in relational databases to query and manipulate data. Two-dimensional table sets (datasets, Excel spreadsheets) form Relational Database Management Systems (RDBMS).
SQL is used for adding, removing, upgrading, and modifying data, but it can’t write complete programs. It can help data scientists obtain data from databases, carry out analytical functions, and modify database structures.
As a data scientist, you need to be skilled in SQL. SQL is structured specifically to help you navigate, connect, and operate with data. When you use it to query a database, it provides you with valuable observations. It has straightforward instructions that will help you save time on the programming you that you normally need to run complex queries.
Mastering SQL can assist you to better understand database systems and improve your competency.
One of the key benefits of using SQL is that it can immediately be accessed while running data operations, which significantly accelerates workflows.
Apache Spark is one of the most popular big data technologies, representing a data computing platform that consists of a set of libraries for parallel data processing on computer clusters.
Spark supports many commonly used programming languages, like Python, R, Java, and Scala, and provides libraries for different functions, from SQL to machine learning, to has the ability to run on a simple computer and on many servers. This is what makes it one of the best choices for processing vast amounts of data.
Spark is developed to facilitate a wide variety of activities in data analytics, ranging from easy loading of data and SQL queries to machine learning and streaming. During these tasks, it can use only one computer machine with a single collection of APIs or a cluster of machines.
Even though complex tasks require the combination of various libraries, Spark makes these processes much easier.
It supports many types of storage systems like Azure Storage and Amazon S3, as well as Apache Kafka, Apache Hadoop, and Apache Cassandra. The biggest benefit of this is that data scientists can combine various storage systems to optimize the way they’re saving data.
Spark has its own SQL libraries. Moreover, it provides machine learning, stream processing, and graph analytics libraries. There are hundreds of external open source libraries outside these libraries, ranging from connectors for different storage structures to algorithms for machine learning.
Apache Spark helps data scientists to avoid data loss. Apache Spark’s power lies in its pace and infrastructure, making data science tasks easy to handle.
2.9 Domain knowledge
The experience and comprehension of a specific area represent domain knowledge. In data science, this concept can be adapted by specifying that it is the understanding of the world to which the data belongs.
Now that almost all industries need data scientists, it’s essential to have domain knowledge that can only be mastered progressively over time. This knowledge includes expertise in areas such as healthcare, finance, HR, insurance, media, energy, etc.
Without a sufficient understanding of the area from which the data originates, you can not unlock the maximum strength of an algorithm. It will simply be too difficult for you to solve a problem you know nothing about.
A high degree of experience in the field will greatly improve the accuracy of the model you want to create. For this reason, data scientists generally have knowledge of several fields they’ve worked in. Of course, they aren’t expected to become experts in everything but they usually have domain knowledge in several areas.
LESSON 3: The applications of data science
3.1 Which industries use data science?
Data science can be used for:
- Detecting frauds
- Automating processes and decision-making
- Predicting outcomes
- Identifying trends
- Facial, voice, and text recognition
- Generating recommendations about products, social media content, books, etc.
In healthcare, data science has brought many transformations. Medical practitioners are exploring new ways to understand illnesses, practice preventive medicine, detect diseases more efficiently, and test new care options with a large network of data now accessible through EMRs, databases, or even wearable fitness devices.
Self-driving vehicle manufacturers like Tesla and Volkswagen leverage data science in their processes of collecting vast amounts of data through sensors to mimic the behaviour of a human driver. They use techniques like machine learning and predictive analytics to, for example, predict traffic jams and take the passenger through the safest route.
Streaming services also leverage data science to show you content that you are most likely to play. That’s how you get your own personal feed on Spotify or Netflix.
In fintech, data science is leveraged as a tool to detect frauds and perform safe transactions. Companies like PayPal and Stripe invest a lot in their data science processes to improve their services.
Data science is also very useful in discovering cybercrime. For example, cybersecurity company Kaspersky utilizes data science to identify a large number of malware samples every day.
E-commerce companies also use data science. It helps them improve customer targeting, identify potential customers, and customize product recommendations, all activities which contribute to increased sales volumes.
In the manufacturing industry, data science helps operators monitor performance, implement predictive maintenance, predict demand, optimize supply chains, automate processes, etc.
3.2 What companies use data science?
From healthcare to football and food delivery services, here are some of the most popular uses of data science:
Google: detecting metastatic breast cancer
90% of breast cancer deaths worldwide occur due to metastasis. But researchers at the San Diego Naval Medical Center and Google AI, a division devoted to artificial intelligence ( AI) science within Google, have developed a promising approach utilizing cancer-detecting algorithms that independently review biopsies of lymph nodes.
Their AI-based method, called Lymph Node Assistant, or LYNA, was announced through a paper published in the American Journal of Surgical Pathology titled “Artificial Intelligence-Based Breast Cancer Nodal Metastasis Detection.” The tests showed an area under the receiver operating characteristic (AUC), an indicator of detection precision, of 99%.
LYNA is based on Inception-v3, a deep learning model for open-source image recognition that has been reportedly achieving an accuracy of over 78.1% on the ImageNet dataset from Stanford. According to the researchers, the model requires a 299-pixel image as input, which is the default input size of Inception-v3, to outline tumours at the pixel level, and extract labels of the tissue patch, modifying the algorithmic weights of the model to minimize error.
Clue: predicting menstrual cycles
The Clue app can predict menstrual cycles and help female users improve their reproductive health. The app leverages data science by collecting user data such as period dates, moods, skin conditions, hair conditions, and other data points to predict the starting date of the next period and provide health recommendations.
Data scientists exploit this richness of anonymized data behind the scenes with techniques such as Python and Jupyter’s Notebook. The algorithm notifies users when they are fertile or at an increased risk for conditions such as an ectopic pregnancy.
Uber Eats: delivering fresh and hot food
What Uber Eats wants is to deliver hot food to customers’ doors. Although it sounds simple, in reality, this requires a lot of work. Uber Eats leverages machine learning, sophisticated statistical models and staff meteorologists. The team needs to foresee how any potential aspect, from floods to holiday rushes, will affect traffic and cooking time to optimize the complete distribution process.
The processes are supported by Uber’s in-house machine learning platform called Michelangelo. The platform provides considerable support to accelerate the overall process of solving machine learning challenges for data scientists and engineers. It offers generic solutions for data analysis, function engineering, and modelling, supporting offline and online forecasts.
Liverpool: the data science-backed Premier League champion
Liverpool’s data scientist Ian Graham built a mathematical model that measures how the ultimate probability of winning for a team is influenced by any throw, sprint, and goal attempt. The football team started using this model in their playing strategy and, most importantly, to identify new players that could contribute to its success.
Thanks to this model, Liverpool brought some undervalued players on board and made a great team. The goal was to find potential stars undiscovered by teams with bigger budgets, such as Manchester United. Thanks to this strategy, the team won the 2019 Premier League championship.
LESSON 4: Data science career outlook
People who believe that data science is a trend that’s going away soon are very wrong. On the contrary, we’ve only scratched the surface of data science. In fact, the US Bureau of Labour Statistics predicts that the rising demand for data science will open 11.5 million new job posts by 2026. The World Economic Forum also agrees, stating that being a data scientist will become the most promising position by 2022.
You’re still on time because the world needs more data scientists. The profession has still not become mainstream, and the competition isn’t very harsh. We believe that we’re yet to witness a growth in the number of data scientists globally.
If you were ever interested in a particular industry, data science could allow you to work anywhere you want to. The possibilities are endless. Data science is being widely used by sectors like healthcare, banking, consulting, and e-commerce. Thanks to its flexibility, you will get the ability to work in different sectors throughout your career.
Besides, you’ll get to work on inspiring projects that make people’s lives easier. By simplifying redundant tasks, data science reduces boring projects. To train machines to perform routine tasks, businesses use past data. This has reduced the tedious physical activities traditionally carried out by individuals.
Finally, data scientists are among the people who earn the most. Glassdoor data shows that the average annual pay of a data scientist is $113,309.
4.1 What should your background be?
There are no strict rules here. You could be a philosopher or an archaeologist. Some data scientists own Master and PhD degrees. On the other hand, some of the best industry professionals are self-learners.
If you concentrate on building the skills you need to resolve the challenges you are going to tackle and the obstacles you need to overcome, your background is the least important thing. Any knowledge of coding and statistics would be beneficial to have, but not essential to get started.
The motivation to learn continuously is one thing that is definitely a must. If you don’t devote enough time and attention to discovering the mysteries of the field and becoming an excellent data scientist, you won’t be able to succeed.
4.2 The soft skills of a data scientist
Here are some soft skills that can help you become a better data scientist:
First of all, you need to be curious. You should always remain hungry for new knowledge. There are so many areas to be studied and so many data points to be explored that a data scientist has to have an intense desire to look for solutions.
Secondly, you’ve got to be organized. You’ll have to manage various tools when you’re working with large databases with multiple data points. At the end of your research, active organization and management will take you to the right conclusions.
You also need to be very patient and persistent. As your job will be to solve problems, you’ll probably wander a lot before getting to the right solution. That’s why you need to be patient and not to give up easily. Getting to the foundation of the problem might sometimes require a lot of effort.
When you’re a data scientist, remaining focused, creative, and detail-oriented is very helpful. Most importantly, you should like working with data. As it will take up a major part of your day, you’ll have to get comfortable with it. Do you estimate the chance of different scenarios in your life often? Do you rely on facts instead of intuition? If yes, then data science is the right job for you.
4.3 Data science careers
These are the career paths you could take after learning data science:
- Data scientist. They gather, clean, and analyze data. To detect patterns that can benefit an enterprise and help make rational business decisions, data scientists need to be able to analyze large volumes of structured and unstructured data.
- Data analyst. They transform and utilize massive datasets to match the preferred analysis of the company. They also monitor website metrics and perform A/B testing. Their expertise powers the decision-making process by generating reports that accurately explain their studies’ trends and viewpoints.
- Machine learning engineer. They create funnels of data and provide technical solutions. They possess outstanding statistical and coding skills, as well as technical expertise to complete these assignments.
- Data engineer. Create and maintain data infrastructure. With more coding skills required, this role requires you to build data pipelines for various company sectors. This way, you are expected to develop a connected and accessible data system.
- Business intelligence (BI) analyst. BI professionals analyze information to detect trends and possibilities in the industry that can reach audiences. They use different BI instruments to turn data into actionable information that contributes to smarter decisions within the company.
- Statistician. To facilitate the corporate decision-making process, they collect, interpret, and distribute data to all the related stakeholders.
The planet is continually upgrading to a better version of itself. This opens up the opportunity for the development of data science to cope with vast quantities of data and make end customers happy.
Data science provides countless career opportunities. For improved consumer service, multinational enterprises are always processing data and transforming it. To get the best results, important sectors such as banking, healthcare industries, travel, e-commerce platforms, and many other sectors leverage data science.
The world currently needs more than 250,000 data scientists. There has never been a better time to become a data scientist. We hope that our course has given you an overview of what it looks like to be a data scientist and that you will embark on this exciting journey together with us.