Data science

Data science is the application of statistics, computer science, and the scientific method to the practice of data analysis to convert data into information, with an emphasis on making accurate predictions.

Data science skills

Data science practices may overlap significantly with business intelligence, data engineering, and data analysis practices. However, the primary focus of data science is to apply machine learning and statistical methods to data.

Broadly speaking, a data scientist has at least foundational knowledge of science, statistics, data analysis, and programming. Depending on the individual, the skill set may be significantly weighted more toward one or two of these skills rather than evenly balanced among all three.

The basic activities of data science are to collect, clean, and transform data to create descriptive statistics and visualizations that help understand and communicate the data and its overall quality, to build statistical models to support statistical inference, hypothesis testing, and predictions/projections, and to use machine learning to automate decision making and predictions.

According to the O'Reilly 2021 Data/AI Salary Survey, the most popular programming languages for data science are Python (61% of surveyed data scientists), SQL (54%), and JavaScript (32%). The most popular machine learning packages are PyTorch (19% of surveyed data scientists), TensorFlow (20%), and scikit-learn (27%), all of which are Python libraries.

Data science process

The data science process starts with a question that can come in the form of a hypothesis to be tested, a decision to be made, or a prediction to be made. Data is then collected or, in some cases, created through experimentation. The collected data is then prepared through data wrangling. Next, a data model is prepare; this can be a numerical, statistical, or machine learning model that helps to analyze evidence to validate/invalidate a hypothesis, support a decision, or predict an outcome. The model is then evaluated for accuracy and, once validated, deployed and put into formal use.

Generally these steps are followed iteratively and non-sequentially, with each step being repeated as needed to fully develop the model before it goes into production. Even after an initial production deployment, models are usually still iterated upon, improved, and redeployed.

Deeper Knowledge on Data Science

Broader Topics Related to Data Science

James's Knowledge Graph

Data science

Data science skills

Data science process

Deeper Knowledge on Data Science

Machine Learning (ML)

List of Public and Open Datasets

Python Open-Source Data Libraries

Data Wrangling

Data Teams

Data Products

Broader Topics Related to Data Science

Business Intelligence

Data Analysis

Statistics

Data

Computer Science

Data Science Knowledge Graph