What is Data Science?
Data Science is an interdisciplinary field that combines statistics, computer science, and domain expertise to extract meaningful insights and knowledge from structured and unstructured data
Why is Data Science important?
Data Science helps businesses make data-driven decisions, optimize processes, and predict future trends. It is used in various industries like healthcare, finance, entertainment, and transportation.
Can you give an example of a real-world application of Data Science?
Netflix uses Data Science to recommend shows and movies based on user viewing history and preferences.
What are the key steps in the Data Science workflow?
The key steps are:
- Problem Definition
- Data Collection
- Data Cleaning
- Exploratory Data Analysis (EDA)
- Modeling
- Deployment
What tools and technologies are commonly used in Data Science?
Common tools include Python (with libraries like Pandas, NumPy, Matplotlib, and Scikit-learn), R, Tableau, Power BI, Hadoop, Apache Spark, SQL, and NoSQL databases.
What are the different types of data?
The three main types of data are:
- Structured Data: Organized data with a clear format (e.g., SQL databases).
- Unstructured Data: Data without a predefined format (e.g., text, images, videos).
- Semi-Structured Data: Data that does not conform to a rigid structure but has some organizational properties (e.g., JSON, XML).
What are some common methods of data collection?
Common methods include surveys, APIs, and web scraping.
How do you handle missing data in a dataset?
Missing data can be handled by:
- Removing rows or columns with missing values.
- Imputing missing values using mean, median, or mode.
- Using advanced techniques like K-Nearest Neighbors (KNN) or regression to predict missing values.
What is data normalization, and why is it important?
Data normalization is the process of scaling data to a standard range (e.g., 0 to 1). It is important because it ensures that all features contribute equally to the model, especially in algorithms sensitive to feature scales like KNN or gradient descent-based models.
Write a Python code to handle missing data in a Pandas Data Frame?
import pandas as pd
import numpy as np
# Sample DataFrame
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [10, 11, 12, 13]}
df = pd.DataFrame(data)
# Fill missing values with the mean of the column
df.fillna(df.mean(), inplace=True)
print(df)
What is the purpose of EDA in Data Science?
EDA helps in understanding the data, identifying patterns, detecting outliers, and forming hypotheses. It is a crucial step before building models.
What are the measures of central tendency?
The measures of central tendency are:
- Mean: The average of all values.
- Median: The middle value when data is sorted.
- Mode: The most frequently occurring value.
What is the difference between variance and standard deviation?
Variance measures the spread of data points around the mean, while standard deviation is the square root of variance and provides a measure of spread in the same units as the data.
Write a Python code to create a histogram using Matplotlib?
import matplotlib.pyplot as plt
import numpy as np
# Sample data
data = np.random.normal(100, 15, 1000)
# Create histogram
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram of Data')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
How do you identify outliers in a dataset?
Outliers can be identified using:
- Box Plots: Data points outside the whiskers are considered outliers.
- Z-Score: Data points with a Z-score greater than 3 or less than -3 are outliers.
- IQR (Interquartile Range): Data points below Q1 – 1.5IQR or above Q3 + 1.5IQR are outliers.