# Data science

**Data science** is the application of statistics, computer science, and the scientific method to the practice of data analysis to convert data into information, with an emphasis on making accurate predictions.

## Data science skills

Data science practices may overlap significantly with business intelligence, data engineering, and data analysis practices. However, the primary focus of data science is to apply machine learning and statistical methods to data.

Broadly speaking, a data scientist has at least foundational knowledge of science, statistics, data analysis, and programming. Depending on the individual, the skill set may be significantly weighted more toward one or two of these skills rather than evenly balanced among all three.

The basic activities of data science are to collect, clean, and transform data to create descriptive statistics and visualizations that help understand and communicate the data and its overall quality, to build statistical models to support statistical inference, hypothesis testing, and predictions/projections, and to use machine learning to automate decision making and predictions.

According to the O'Reilly 2021 Data/AI Salary Survey, the most popular programming languages for data science are Python (61% of surveyed data scientists), SQL (54%), and JavaScript (32%). The most popular machine learning packages are PyTorch (19% of surveyed data scientists), TensorFlow (20%), and scikit-learn (27%), all of which are Python libraries.

## Data science process

The data science process starts with a question that can come in the form of a hypothesis to be tested, a decision to be made, or a prediction to be made. Data is then collected or, in some cases, created through experimentation. The collected data is then prepared through data wrangling. Next, a data model is prepare; this can be a numerical, statistical, or machine learning model that helps to analyze evidence to validate/invalidate a hypothesis, support a decision, or predict an outcome. The model is then evaluated for accuracy and, once validated, deployed and put into formal use.

Generally these steps are followed iteratively and non-sequentially, with each step being repeated as needed to fully develop the model before it goes into production. Even after an initial production deployment, models are usually still iterated upon, improved, and redeployed.

## Deeper Knowledge on Data Science

### Data Products

Ways of making data available

### Data Teams

The make up and measures of effective data teams

### Data Wrangling

Transforming "raw" data into a more easily analyzed form through normalization and format standardization

### Machine Learning (ML)

Machine learning terms, processes, and methods

### List of Public and Open Datasets

A list of freely available datasets for use in analytics and machine learning

### PySpark Recipes for Data Cleansing, Analysis, and Science

Recipes for using PySpark

### Python Open-Source Data Libraries

Python libraries commonly used in data science and analysis

## Broader Topics Related to Data Science

### Business Intelligence

Methods to bridge the gap between data and business

### Computer Science

The study of algorithms, data structures, information, and computation

### Data Analysis

The transformation of data to information

### Data

Facts, statistics, and references to information

### Statistics

The analysis of numerical data