Top 36 Data Science Coding Interview Questions and Answers
Data science has become one of the most revolutionizing fields that help companies make more informed and profitable business decisions. As a result, almost every tech and non-tech companies hire data scientists to examine and draw insights from large data sets. Since the job market is quite competitive, it is important to prepare well to secure a job in the company of your choice. This is why we curated a range of commonly asked data science coding job interview questions. The guide covers everything from basic principles to advanced concepts. With the help of the coding interview questions, you can highlight your technical expertise and make a lasting impression before your employers.
Data Science Coding Interview Questions for Beginners
Data science coding job interview questions for beginners primarily focus on academic projects and fundamental concepts. Whichever company you apply to, research online resources to learn about the commonly asked data science interview questions. Here are some of the entry-level coding questions that can efficiently highlight your technical skills.
Q1. How can you create a function in Python to reverse a string?
Sample Answer: You can reverse a string in Python using the slicing feature. The slicing notation s[::-1] starts from the end of the string and moves to the beginning, effectively reversing it. This method is both concise and efficient. Here’s an example:
def reverse_string(s):
return s[::-1]
Q2. What distinguishes a list from a tuple in Python?
Sample Answer: The primary distinction between a list and a tuple in Python is mutability. A list is mutable, meaning its contents can be changed after creation. For instance:
my_list = [1, 2, 3]
my_list.append(4) # Now my_list is [1, 2, 3, 4]
Conversely, a tuple is immutable. Once created, its contents cannot be altered. Tuples are defined using parentheses:
my_tuple = (1, 2, 3)
# Attempting to modify it like my_tuple.append(4) would raise an error.
Choosing between them depends on whether you need to modify the data. Tuples can also be slightly faster and are often used when data should remain constant.
Q3. Can you write a function to determine if a number is prime?
Sample Answer: To check if a number is prime, you need to verify that it is only divisible by 1 and itself. Here’s a simple function for this:
def is_prime(n):
if n <= 1:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
This function first checks if the number is less than or equal to 1 (which is not prime). It then tests divisibility from 2 up to the square root of the number.
Q4. What is the difference between == and is in Python?
Sample Answer: In Python, == checks for value equality, meaning it verifies if two variables hold the same value. For example:
a = [1, 2, 3]
b = [1, 2, 3]
print(a == b) # True, because their values are identical.
On the other hand, is checks for identity, meaning it determines if two variables point to the same object in memory:
print(a is b) # False, because they are different objects.
c = a
print(a is c) # True, because c refers to the same object as a.
This distinction is crucial when working with mutable objects like lists.
Q5. How would you implement a function to calculate the factorial of a number?
Sample Answer: You can calculate the factorial of a number using either iteration or recursion. Here’s an example using iteration:
def factorial(n):
if n < 0:
return "Invalid input"
result = 1
for i in range(1, n + 1):
result *= i
return result
This function initializes the result to 1 and multiplies it by each integer up to n. It’s straightforward and avoids potential stack overflow issues with large numbers that recursion might encounter.
Q6. What are generators in Python? Provide an example.
Sample Answer: Generators are a special type of iterator in Python that allow for lazy iteration over sequences of values. They generate values on-the-fly and use less memory. You create a generator using a function along with the yield keyword. Here’s an example:
def my_generator():
for i in range(1, 4):
yield i
gen = my_generator()
print(next(gen)) # Output: 1
print(next(gen)) # Output: 2
print(next(gen)) # Output: 3
Using yield instead of return enables the function to produce a series of values over time while pausing and resuming as needed. This feature is particularly useful for processing large datasets or streams of data.
Pro Tip: As you read our blog further, you will come across several data science coding job interview questions. The purpose of our blog is to help you ace the interview. But before you explore more data science-related coding topics and questions, read our guide on how to get a data science job and explore the best opportunities for your career.
Q7. Can you explain the differences between the map and filter functions in Python?
Sample Answer: Both map and filter are built-in functions used for functional programming in Python but serve different purposes. The map function applies a specified function to every item in an iterable and returns a new iterable with the results. For instance:
def square(x):
return x * x
numbers = [1, 2, 3, 4]
squared = map(square, numbers)
print(list(squared)) # Output: [1, 4, 9, 16]
Conversely, the filter function applies a specified function to all items in an iterable and returns only those items for which the function returns True.
def is_even(x):
return x % 2 == 0
numbers = [1, 2, 3, 4]
evens = filter(is_even, numbers)
print(list(evens)) # Output: [2, 4]
Thus, the map transforms each item while the filter selects items based on certain conditions. Both are powerful tools for efficient data processing.
Q8. How would you implement binary search in Python?
Sample Answer: Binary search is an efficient algorithm used to find an item from a sorted list by repeatedly dividing the search interval in half. If the search key’s value is less than that of the middle item in the interval, narrow it down to the lower half, otherwise, narrow it down to the upper half. Here’s how you can implement it:
def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1 # Target not found
In this function, we initialize two pointers (left and right) at the start and end of the list respectively and repeatedly check the middle element while adjusting pointers based on comparisons with the target value.
Q9. Can you explain how a hash table operates? Provide an example.
Sample Answer: A hash table is a data structure that stores key-value pairs using a hash function to compute an index into an array of buckets or slots from which desired values can be retrieved efficiently. The main advantage of hash tables lies in their average-case constant-time complexity (O(1)) for lookups, insertions, and deletions.
Here’s a simple example using Python’s dictionary (which functions as a hash table):
# Creating a hash table (dictionary)
hash_table = {}
# Adding key-value pairs
hash_table["name"] = "Alice"
hash_table["age"] = 25
hash_table["city"] = "New York"
# Retrieving values
print(hash_table["name"]) # Output: Alice
print(hash_table["age"]) # Output: 25
print(hash_table["city"]) # Output: New York
In this example, Python’s dictionary implementation implicitly handles hashing. Keys are hashed to produce unique indices where corresponding values are stored.
Q10. How would you implement bubble sort in Python?
Sample Answer: Bubble sort is a straightforward sorting algorithm that repeatedly steps through the list comparing adjacent elements and swapping them if they’re out of order. This process continues until no swaps are needed (the list is sorted).
Here’s how you can implement it:
def bubble_sort(arr):
n = len(arr)
for i in range(n):
for j in range(0, n - i - 1):
if arr[j] > arr[j + 1]:
arr[j], arr[j + 1] = arr[j + 1], arr[j]
# Example usage
arr = [64, 34, 25, 12, 22, 11, 90]
bubble_sort(arr)
print("Sorted array:", arr)
In this implementation, we use two nested loops. The inner loop performs comparisons and swaps while the outer loop ensures that this process repeats until the entire list is sorted.
Q11. Explain and demonstrate the difference between list and dictionary comprehension with an example of converting a list of temperatures from Celsius to Fahrenheit.
Sample Answer: List comprehension allows you to create a new list by applying an expression to each item in an existing iterable (like a list). It’s a concise way to generate lists. On the other hand, dictionary comprehension is similar to list comprehension but creates a dictionary instead. It allows you to create key-value pairs from an existing iterable.
Given a list of Celsius temperatures, we can use both types of comprehension to create a new list of temperatures in Fahrenheit and a dictionary that maps each Celsius temperature to its corresponding Fahrenheit value.
Here’s an example of converting Celsius to Fahrenheit:
# List of Celsius temperatures
celsius = [0, 10, 20, 30, 40]
# List comprehension
fahrenheit_list = [((9/5) * temp + 32) for temp in celsius]
# Dictionary comprehension (celsius as key, fahrenheit as value)
fahrenheit_dict = {temp: ((9/5) * temp + 32) for temp in celsius}
print(fahrenheit_list) # [32.0, 50.0, 68.0, 86.0, 104.0]
print(fahrenheit_dict) # {0: 32.0, 10: 50.0, 20: 68.0, 30: 86.0, 40: 104.0}
Q12. Create a generator function that yields prime numbers up to a given limit. Explain why generators are memory efficient.
Sample Answer: Generators in Python are a convenient way to create iterators using the ‘yield’ keyword. They generate values one at a time and only when requested, which makes them memory efficient compared to lists or other data structures that store all values at once.
Here’s a code to create a generator function that yields prime numbers up to a given limit:
def prime_generator(limit):
def is_prime(n):
"""Check if a number is prime."""
if n < 2:
return False
for i in range(2, int(n ** 0.5) + 1):
if n % i == 0:
return False
return True
n = 2 # Start from the first prime number
while n < limit:
if is_prime(n):
yield n # Yield the prime number
n += 1
# Example usage
primes = prime_generator(20)
print(list(primes)) # Output: [2, 3, 5, 7, 11, 13, 17, 19]
Pro Tip: Are you a recent computer science, mathematics, or statistics graduate and want to explore the data science field? You can start your professional career by applying for data science internships. Read our guide to data science internship interview questions to prepare well for the interview process and get an idea of what questions are asked.
Data Science Coding Interview Questions for Mid-level Professionals
As you grow in your career, you will naturally aim to acquire better knowledge and skills in your professional field. This section covers data science coding job interview questions for mid-level professionals to assess your understanding of foundational principles and problem-solving skills.
Q13. Given a DataFrame with daily stock prices, calculate the 7-day moving average and the daily percentage change.
Sample Answer: This involves using DataFrame methods like rolling() for moving averages and pct_change() for percentage changes. We’ll also handle missing values appropriately.
import pandas as pd
import numpy as np
def analyze_stock_prices(df):
"""
df should have columns: 'date' and 'price'
"""
# Create copy to avoid modifying original
analysis_df = df.copy()
# Calculate 7-day moving average
analysis_df['7day_ma'] = df['price'].rolling(window=7, min_periods=1).mean()
# Calculate daily percentage change
analysis_df['daily_return'] = df['price'].pct_change() * 100
return analysis_df
# Example usage
dates = pd.date_range(start='2023-01-01', periods=10)
prices = [100, 102, 101, 103, 104, 103, 105, 106, 107, 108]
df = pd.DataFrame({'date': dates, 'price': prices})
result = analyze_stock_prices(df)
Q14. Write a function that handles missing values in a DataFrame based on the data type of each column.
Sample Answer: Different data types require different strategies for handling missing values. Numerical data might use mean/median, while categorical data might use mode or a special category.
The following implementation fills missing values in numerical columns with the median, in categorical columns with the mode, and uses forward fill for datetime columns.
def handle_missing_values(df):
"""
Handle missing values based on column type:
- Numeric: Fill with median
- Categorical: Fill with mode
- Datetime: Forward fill
"""
df_cleaned = df.copy()
for column in df_cleaned.columns:
# Get column type
dtype = df_cleaned[column].dtype
# Handle numeric columns
if np.issubdtype(dtype, np.number):
median_value = df_cleaned[column].median()
df_cleaned[column].fillna(median_value, inplace=True)
# Handle categorical columns
elif dtype == 'object' or dtype.name == 'category':
mode_value = df_cleaned[column].mode()[0]
df_cleaned[column].fillna(mode_value, inplace=True)
# Handle datetime
elif np.issubdtype(dtype, np.datetime64):
df_cleaned[column].fillna(method='ffill', inplace=True)
return df_cleaned
# Example usage
data = {
'numeric': [1, 2, np.nan, 4],
'categorical': ['A', 'B', np.nan, 'B'],
'datetime': pd.date_range('2023-01-01', periods=4)
}
df = pd.DataFrame(data)
df.loc[2, 'datetime'] = pd.NaT
cleaned_df = handle_missing_values(df)
Q15. Write a function to calculate the confidence interval for a mean, handling both normal and t-distributions based on sample size.
Sample Answer: Confidence intervals provide a range of values that likely contain the population mean. We use the normal distribution for large samples (n≥30). For smaller samples, we use the t-distribution. The function should handle both cases automatically.
The following code uses the normal distribution for larger samples and the t-distribution for smaller samples.
from scipy import stats
import numpy as np
def calculate_confidence_interval(data, confidence=0.95):
"""
Calculate confidence interval for the mean.
Automatically uses t-distribution for n<30, normal distribution for n≥30
"""
n = len(data) # Sample size
mean = np.mean(data) # Sample mean
std_error = stats.sem(data) # Standard error of the mean
# Choose distribution based on sample size
if n < 30:
# Use t-distribution
t_value = stats.t.ppf((1 + confidence) / 2, df=n-1) # Critical t-value
margin_error = t_value * std_error # Margin of error
else:
# Use normal distribution
z_value = stats.norm.ppf((1 + confidence) / 2) # Critical z-value
margin_error = z_value * std_error # Margin of error
# Return the confidence interval
return (mean - margin_error, mean + margin_error)
# Example usage
data = np.random.normal(loc=100, scale=15, size=25) # Sample data
ci_lower, ci_upper = calculate_confidence_interval(data) # Calculate CI
print(f"95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")
Q16. Implement a function that performs k-fold cross-validation using scikit-learn’s Pipeline, including preprocessing steps.
Sample Answer: K-fold cross-validation is essential for evaluating the performance of a machine learning model while ensuring that the model is robust and not overfitting. By using a pipeline, we can integrate preprocessing steps directly into the cross-validation process, which helps prevent data leakage.
Here’s how to set it up:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
def create_model_pipeline(X, y, k_folds=5):
"""
Create and evaluate a pipeline with preprocessing and model training
"""
# Create a pipeline
pipeline = Pipeline([
('scaler', StandardScaler()), # Step to standardize features
('classifier', LogisticRegression()) # Step for the classifier
])
# Perform k-fold cross-validation
scores = cross_val_score(pipeline, X, y, cv=k_folds, scoring='accuracy')
# Calculate statistics
mean_score = scores.mean()
std_score = scores.std()
return {
'pipeline': pipeline,
'cv_scores': scores,
'mean_score': mean_score,
'std_score': std_score
}
# Example usage
from sklearn.datasets import make_classification
# Create a synthetic dataset for demonstration
X, y = make_classification(n_samples=1000, random_state=42)
# Create the model pipeline and evaluate it
results = create_model_pipeline(X, y)
# Display the results
print(f"Cross-validation accuracy: {results['mean_score']:.3f} ± {results['std_score']:.3f}")
Pro Tip: While preparing for data science coding job interview questions is important, you should explore companies that offer lucrative jobs in this field. Read our guide on the highest-paying data science companies and research their interview process to prepare well for the selection process.
Q17. Write a SQL query to find the top 3 departments with the highest average salary, but only include departments with at least 5 employees.
Sample Answer: To find the top 3 departments with the highest average salary, while including only those departments that have at least 5 employees, we will follow these steps:
- Group the data by department to calculate the average salary and count of employees.
- Filter out departments with fewer than 5 employees using the HAVING clause.
- Rank the departments based on their average salary.
- Retrieve the top 3 departments with the highest average salary.
Here’s how the SQL query looks:
WITH dept_stats AS (
SELECT
d.department_id,
d.department_name,
COUNT(e.employee_id) AS emp_count,
AVG(e.salary) AS avg_salary
FROM employees e
JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_id, d.department_name
HAVING COUNT(e.employee_id) >= 5
),
ranked_depts AS (
SELECT
department_name,
emp_count,
avg_salary,
RANK() OVER (ORDER BY avg_salary DESC) AS salary_rank
FROM dept_stats
)
SELECT
department_name,
emp_count,
ROUND(avg_salary, 2) AS average_salary
FROM ranked_depts
WHERE salary_rank <= 3
ORDER BY avg_salary DESC;
This query ranks the departments with at least 5 employees and outputs the top 3 based on average salary.
Q18. Create a custom iterator class that generates Fibonacci numbers up to a specified limit.
Sample Answer: In Python, custom iterators need to implement __iter__ and __next__ methods. The iterator should maintain its state and know when to stop. Here’s the code for creating a custom iterator class that generates Fibonacci numbers up to a specified limit:
class FibonacciIterator:
def __init__(self, limit):
self.limit = limit
self.previous = 0
self.current = 1
def __iter__(self):
return self
def __next__(self):
if self.previous > self.limit:
raise StopIteration
result = self.previous
self.previous, self.current = (
self.current,
self.previous + self.current
)
return result
# Example usage
fib = FibonacciIterator(100)
print(list(fib)) # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
Q19. Write a decorator that measures and logs the execution time of any function it decorates.
Sample Answer: In Python, decorators allow us to modify or enhance the behavior of functions or methods. The decorator measures and logs the execution time of any function it decorates, which is useful for performance monitoring. The following example uses Python’s time module.
import time
from functools import wraps
def measure_time(func):
"""
Decorator to measure the execution time of a function.
Parameters:
- func: The function to be wrapped.
Returns:
- wrapper: The wrapped function that logs execution time.
"""
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time() # Record the start time
result = func(*args, **kwargs) # Call the original function
end_time = time.time() # Record the end time
execution_time = end_time - start_time # Calculate execution time
print(f"{func.__name__} took {execution_time:.4f} seconds") # Log execution time
return result # Return the result of the original function
return wrapper
# Example usage
@measure_time
def slow_function():
"""A sample function that simulates a delay."""
time.sleep(1) # Simulate a slow operation
return "Done"
# Call the decorated function
result = slow_function()
print(result) # Output the result
Q20. Write a function that resamples daily data to monthly averages and handles missing values appropriately.
Sample Answer: To analyze time series data effectively, resampling daily data to monthly averages can help identify trends and patterns. The code will utilize the Pandas library’s ‘datetime’ functionality and the resample() method, ensuring that missing values are handled appropriately during the calculations.
import pandas as pd
import numpy as np
def resample_to_monthly(df, date_column, value_column):
"""
Resample daily data to monthly averages, handling missing values appropriately.
Parameters:
- df: DataFrame containing the daily data.
- date_column: The name of the column that contains date values.
- value_column: The name of the column that contains the values to be averaged.
Returns:
- DataFrame containing monthly averages, counts, minimum, and maximum values.
"""
# Ensure the date column is in datetime format
df[date_column] = pd.to_datetime(df[date_column])
# Set the date column as the index for resampling
df_indexed = df.set_index(date_column)
# Resample the data to monthly frequency and calculate statistics
monthly_stats = df_indexed[value_column].resample('M').agg(
average='mean', # Calculate average for the month
count='count', # Count non-null values
min='min', # Minimum value for the month
max='max' # Maximum value for the month
)
# Reset index to return date as a column
monthly_stats = monthly_stats.reset_index()
return monthly_stats
# Example usage
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
values = np.random.normal(100, 10, len(dates))
# Introduce some random missing values
mask = np.random.choice([True, False], size=values.shape, p=[0.1, 0.9]) # 10% missing values
values[mask] = np.nan
df = pd.DataFrame({'date': dates, 'value': values})
# Resample the daily data to monthly averages
monthly_data = resample_to_monthly(df, 'date', 'value')
# Display the resulting monthly statistics
print(monthly_data)
Q21. Create a function that generates interaction features between numerical columns and creates dummy variables for categorical columns.
Sample Answer: Here’s a function that creates interaction features between numerical columns and generates dummy variables for categorical columns:
def engineer_features(df, numerical_cols, categorical_cols):
"""
Generate interaction features and dummy variables
"""
result_df = df.copy()
# Create interaction features
if len(numerical_cols) >= 2:
for i in range(len(numerical_cols)):
for j in range(i+1, len(numerical_cols)):
col1, col2 = numerical_cols[i], numerical_cols[j]
result_df[f'{col1}_{col2}_interaction'] = (
result_df[col1] * result_df[col2]
)
# Create dummy variables
for col in categorical_cols:
dummies = pd.get_dummies(
result_df[col],
prefix=col,
drop_first=True
)
result_df = pd.concat([result_df, dummies], axis=1)
result_df.drop(col, axis=1, inplace=True)
return result_df
# Example usage
data = {
'age': [25, 30, 35],
'income': [50000, 60000, 70000],
'education': ['HS', 'BS', 'MS'],
'location': ['urban', 'rural', 'urban']
}
df = pd.DataFrame(data)
engineered_df = engineer_features(
df,
numerical_cols=['age', 'income'],
categorical_cols=['education', 'location']
)
Q22. Implement a function to analyze A/B test results, including calculating p-values and confidence intervals for the conversion rates difference.
Sample Answer: Analyzing A/B test results is crucial for understanding the impact of changes on conversion rates. Here’s a code that calculates conversion rates, performs a z-test for proportions, and computes confidence intervals for the difference in conversion rates.
import numpy as np
import scipy.stats as stats
def analyze_ab_test(control_conversions, control_size,
treatment_conversions, treatment_size,
confidence_level=0.95):
"""
Analyze A/B test results, including conversion rates, p-values, and confidence intervals.
Parameters:
- control_conversions: Number of conversions in the control group.
- control_size: Total size of the control group.
- treatment_conversions: Number of conversions in the treatment group.
- treatment_size: Total size of the treatment group.
- confidence_level: Confidence level for the confidence interval.
Returns:
- Dictionary containing conversion rates, rate difference, p-value, and confidence interval.
"""
# Calculate conversion rates
control_rate = control_conversions / control_size
treatment_rate = treatment_conversions / treatment_size
# Calculate standard errors
control_se = np.sqrt(control_rate * (1 - control_rate) / control_size)
treatment_se = np.sqrt(treatment_rate * (1 - treatment_rate) / treatment_size)
# Calculate difference and combined standard error
rate_diff = treatment_rate - control_rate
combined_se = np.sqrt(control_se**2 + treatment_se**2)
# Calculate z-score and p-value for the difference in conversion rates
z_score = rate_diff / combined_se
p_value = 2 * (1 - stats.norm.cdf(abs(z_score))) # Two-tailed p-value
# Calculate critical z-value for the confidence interval
z_critical = stats.norm.ppf((1 + confidence_level) / 2)
ci_lower = rate_diff - z_critical * combined_se
ci_upper = rate_diff + z_critical * combined_se
return {
'control_rate': control_rate,
'treatment_rate': treatment_rate,
'rate_difference': rate_diff,
'p_value': p_value,
'confidence_interval': (ci_lower, ci_upper)
}
# Example usage
results = analyze_ab_test(
control_conversions=100,
control_size=1000,
treatment_conversions=120,
treatment_size=1000
)
# Display results
print("Control Conversion Rate:", results['control_rate'])
print("Treatment Conversion Rate:", results['treatment_rate'])
print("Rate Difference:", results['rate_difference'])
print("P-Value:", results['p_value'])
print("Confidence Interval:", results['confidence_interval'])
Q23. Implement a function that performs time series forecasting using SARIMA (Seasonal ARIMA) and evaluates the model’s performance.
Sample Answer: The Seasonal ARIMA (SARIMA) model is effective for time series data exhibiting seasonality. This function fits a SARIMA model to the data, generates forecasts, and evaluates model performance using error metrics and model quality indicators.
Here’s the code for the same:
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
def sarima_forecast(data, order=(1,1,1), seasonal_order=(1,1,1,12)):
"""
Fit SARIMA model and generate forecasts
"""
# Fit model
model = SARIMAX(
data,
order=order,
seasonal_order=seasonal_order,
enforce_stationarity=False
)
results = model.fit()
# Generate forecasts
forecast = results.get_forecast(steps=12)
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()
# Calculate metrics
predictions = results.get_prediction(start=len(data)-12)
pred_mean = predictions.predicted_mean
metrics = {
'mae': mean_absolute_error(data[-12:], pred_mean[-12:]),
'rmse': np.sqrt(mean_squared_error(data[-12:], pred_mean[-12:])),
'aic': results.aic
}
return {
'model': results,
'forecast': forecast_mean,
'confidence_intervals': forecast_ci,
'metrics': metrics
}
# Example usage
import numpy as np
np.random.seed(42)
dates = pd.date_range(start='2020-01-01', end='2023-12-31', freq='M')
data = pd.Series(np.random.normal(0, 1, len(dates)), index=dates)
forecast_results = sarima_forecast(data)
Q24. Create a function that performs feature selection using mutual information.
Sample Answer: Mutual information is an effective method for feature selection because it measures the statistical dependence between each feature and the target, capturing both linear and non-linear relationships. This function uses mutual information to rank and select the most important features and generates a bar plot of their importance scores.
Here’s how you can create a function that performs feature selection using mutual information:
from sklearn.feature_selection import mutual_info_regression
import matplotlib.pyplot as plt
def select_features_mutual_info(X, y, n_features=5):
"""
Select features using mutual information
"""
# Calculate mutual information scores
mi_scores = mutual_info_regression(X, y)
# Create DataFrame with scores
feature_scores = pd.DataFrame({
'feature': X.columns,
'mi_score': mi_scores
})
# Sort features by importance
feature_scores = feature_scores.sort_values('mi_score', ascending=False)
# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(len(mi_scores)), feature_scores['mi_score'])
plt.xticks(range(len(mi_scores)), feature_scores['feature'], rotation=45)
plt.title('Feature Importance (Mutual Information)')
plt.xlabel('Features')
plt.ylabel('Mutual Information Score')
plt.tight_layout()
# Select top features
selected_features = feature_scores.head(n_features)['feature'].tolist()
return {
'selected_features': selected_features,
'feature_scores': feature_scores,
'plot': plt.gcf()
}
# Example usage
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=10, random_state=42)
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
results = select_features_mutual_info(X_df, y)
Data Science Coding Interview Questions for Experienced Candidates
For more experienced roles, employers often look for candidates with an advanced understanding of technical principles and complex problem-solving skills. These are some of the advanced coding job interview questions for data scientists to asses their analytical and decision-making skills.
Q25. Create a comprehensive text preprocessing pipeline for NLP tasks.
Sample Answer: Text preprocessing prepares raw text for NLP tasks, improving data consistency and quality. The following pipeline covers essential steps, including cleaning, tokenization, stopword removal, lemmatization, and vectorization, making it ready for model training and analysis.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
class TextPreprocessor:
def __init__(self, remove_stopwords=True, lemmatize=True):
self.remove_stopwords = remove_stopwords
self.lemmatize = lemmatize
self.lemmatizer = WordNetLemmatizer()
self.stop_words = set(stopwords.words('english'))
self.vectorizer = TfidfVectorizer()
def clean_text(self, text):
"""Basic text cleaning"""
# Convert to lowercase
text = text.lower()
# Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
def process_text(self, text):
"""Full text processing pipeline"""
# Clean text
text = self.clean_text(text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
if self.remove_stopwords:
tokens = [t for t in tokens if t not in self.stop_words]
# Lemmatize
if self.lemmatize:
tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
return ' '.join(tokens)
def fit_transform(self, texts):
"""Process texts and convert to TF-IDF vectors"""
processed_texts = [self.process_text(text) for text in texts]
return self.vectorizer.fit_transform(processed_texts)
def transform(self, texts):
"""Transform new texts using fitted vectorizer"""
processed_texts = [self.process_text(text) for text in texts]
return self.vectorizer.transform(processed_texts)
# Example usage
texts = [
"This is a sample text with some numbers 123!",
"Another example of text preprocessing in NLP tasks."
]
preprocessor = TextPreprocessor()
vectors = preprocessor.fit_transform(texts)
Q26. Implement an anomaly detection system using Isolation Forest and evaluate its performance.
Sample Answer: Isolation Forest is a robust algorithm for anomaly detection. It isolates observations by randomly selecting features and then splitting them, making it efficient and effective for detecting anomalies in large datasets.
Here’s how to implement an anomaly detection system with Isolation Forest and evaluate its performance:
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score
import numpy as np
class AnomalyDetector:
def __init__(self, contamination=0.1, random_state=42):
"""
Initialize Isolation Forest model.
- contamination: expected proportion of anomalies in the data.
- random_state: seed for reproducibility.
"""
self.model = IsolationForest(
contamination=contamination,
random_state=random_state
)
def fit_predict(self, X):
"""
Fit the model to the data and predict anomalies.
Returns predictions (1 for normal, -1 for anomaly), anomaly scores,
and indices of detected anomalies.
"""
predictions = self.model.fit_predict(X)
scores = self.model.score_samples(X)
return {
'predictions': predictions,
'scores': scores,
'anomaly_indices': np.where(predictions == -1)[0]
}
def evaluate(self, y_true, y_pred):
"""
Evaluate the model’s performance using precision and recall.
Convert Isolation Forest predictions to binary labels (0: normal, 1: anomaly).
"""
y_pred_binary = np.where(y_pred == -1, 1, 0)
return {
'precision': precision_score(y_true, y_pred_binary),
'recall': recall_score(y_true, y_pred_binary)
}
# Example usage
# Generate synthetic data with anomalies
np.random.seed(42)
normal_points = np.random.normal(0, 1, (100, 2))
anomaly_points = np.random.normal(5, 1, (10, 2))
X = np.vstack([normal_points, anomaly_points])
# Create true labels (0: normal, 1: anomaly)
y_true = np.zeros(110)
y_true[100:] = 1
detector = AnomalyDetector()
results = detector.fit_predict(X)
metrics = detector.evaluate(y_true, results['predictions'])
print("Anomaly Detection Metrics:")
print("Precision:", metrics['precision'])
print("Recall:", metrics['recall'])
Q27. Can you write a recursive function to find the nth Fibonacci number?
Sample Answer: The Fibonacci sequence consists of numbers where each number is the sum of the two preceding ones, typically starting with 0 and 1. Here is how to implement a recursive function to find the nth Fibonacci number:
def fibonacci(n):
if n <= 0:
return "Invalid input"
elif n == 1:
return 0
elif n == 2:
return 1
else:
return fibonacci(n - 1) + fibonacci(n - 2)
# Example usage
print(fibonacci(10)) # Output: 34
Q28. Could you explain time complexity and space complexity?
Sample Answer: Time complexity and space complexity are fundamental metrics for assessing an algorithm’s efficiency, especially as input sizes grow.
- Time Complexity: Time complexity indicates how the runtime of an algorithm increases as the input size (n) grows. It’s expressed in Big O notation, which gives an upper bound on the time an algorithm might take to execute. For instance, a linear search has a time complexity of O(n), meaning its running time scales linearly with the input size.
# Example of O(n) time complexity
def linear_search(arr, target):
for i in range(len(arr)):
if arr[i] == target:
return i
return -1
- Space Complexity: This measures how much memory an algorithm uses as a function of input size, also expressed in Big O notation. For example, an algorithm that uses a constant amount of extra memory has a space complexity of O(1).
# Example of O(1) space complexity
def example_function(arr):
total = 0
for i in arr:
total += i
return total
Q29. How would you read a CSV file into a DataFrame using Pandas?
Sample Answer: Reading a CSV file into a Pandas DataFrame is simple and efficient. The read_csv() function loads the CSV data into a structured format ideal for data manipulation and analysis.
import pandas as pd
# Reading a CSV file into a DataFrame
df = pd.read_csv('path_to_file.csv')
# Displaying the first few rows of the DataFrame
print(df.head())
Q30. What distinguishes loc from iloc in Pandas?
Sample Answer: The key difference between loc and iloc lies in how they access and select data within a DataFrame:
- loc (Label-based Indexing): loc allows you to select data using labels or boolean arrays. This means you reference the labels of rows and columns (which could be strings, numbers, or other types), making it ideal when you know the specific labels you want to select.
# Select rows from index 0 to 5 and specific columns by label
df.loc[0:5, ['column1', 'column2']]
- iloc (Position-based Indexing): iloc selects data based on integer positions (row and column indices). It’s helpful when you are not concerned with labels but instead want to access data based on numerical positions (like slicing an array).
# Select rows and columns by integer position
df.iloc[0:5, [0, 1]]
Q31. How do you manage missing values in a DataFrame?
Sample Answer: Handling missing values is crucial for effective data analysis. Pandas offers several methods to deal with missing data:
- Detecting Missing Values: df.isnull() identifies missing values by returning a DataFrame of Boolean values, where True indicates missing data.
# Detect missing values
missing_values = df.isnull()
- Dropping Missing Values: df.dropna() removes rows or columns with missing values, depending on the specified axis (axis=0 for rows and axis=1 for columns).
# Drop rows with missing values
df_cleaned = df.dropna()
# Drop columns with missing values
df_cleaned = df.dropna(axis=1)
- Filling Missing Values: df.fillna(0) replaces missing values with a specified value (e.g., 0). df.fillna(df.mean()) fills missing values with the mean of the respective column.
# Fill missing values with a specific value
df_filled = df.fillna(0)
# Fill missing values with the mean of the column
df_filled = df.fillna(df.mean())
Q32. How would you merge two DataFrames in Pandas?
Sample Answer: To merge two DataFrames, I would use the merge function, which operates similarly to SQL joins. Here is a code to merge two DataFrames in Pandas:
# Creating two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
# Merging DataFrames on the 'key' column
merged_df = pd.merge(df1, df2, on='key', how='inner')
# Displaying the merged DataFrame
print(merged_df)
In this case, using how=’inner’ specifies that I want an inner join. Other options include ‘left’, ‘right’, or ‘outer’ for different types of joins.
Q33. How do you create a NumPy array?
Sample Answer: Creating a NumPy array is quite simple. You can utilize the array function from the NumPy library. Here’s how:
import numpy as np
# Creating a NumPy array from a list
my_array = np.array([1, 2, 3, 4, 5])
# Displaying the array
print(my_array)
This code converts a Python list into a NumPy array. Additionally, you can create arrays with specific shapes and values using functions like np.zeros, np.ones, and np.arange.
Q34. Can you explain broadcasting in NumPy? Provide an example.
Sample Answer: Broadcasting is a powerful feature in NumPy that allows operations on arrays of different shapes without needing to create copies of data. NumPy automatically expands smaller arrays to match larger ones during operations. Here’s an example:
import numpy as np
# Creating a 1D array
arr1 = np.array([10, 20, 30])
# Creating a 2D array
arr2 = np.array([[1], [2], [3]])
# Broadcasting arr1 across arr2
output = arr1 * arr2
# Displaying the output
print(output)
Output:
[[ 10 20 30]
[ 20 40 60]
[ 30 60 90]]
In this case, arr1 is broadcasted to match the shape of arr2, and the multiplication is performed element-wise. Broadcasting eliminates the need for reshaping or looping, making code cleaner and more efficient.
Q35. How do you transpose a NumPy array?
Sample Answer: Transposing an array involves swapping its rows and columns. You can achieve this using the transpose method or the ‘.T attribute’. Here’s how:
import numpy as np
# Creating a 2D array
array = np.array([[1, 2, 3], [4, 5, 6]])
# Transposing the array
transposed_array = array.T
# Displaying the transposed array
print(transposed_array)
Output:
[[1 4]
[2 5]
[3 6]]
This operation is especially useful in linear algebra and data manipulation contexts.
Q36. How do you perform matrix multiplication using NumPy?
Sample Answer: Matrix multiplication in NumPy can be achieved either by using the dot() function or the @ operator, both of which are suitable for this operation. Here is how you can perform matrix multiplication using NumPy:
import numpy as np
# Creating two matrices
mat1 = np.array([[2, 3], [4, 5]])
mat2 = np.array([[6, 7], [8, 9]])
# Matrix multiplication using the dot function
result = np.dot(mat1, mat2)
# Alternatively, using the @ operator
result_alt = mat1 @ mat2
# Displaying the result
print(result)
Output:
[[36 41]
[64 73]]
In this case, matrix multiplication is performed by combining rows from mat1 with columns from mat2. This operation is fundamental in areas such as linear algebra and machine learning.
Conclusion
You must be well aware of both fundamental principles and advanced concepts to ace your data science interview. Practice the data science coding job interview questions to highlight your technical understanding, work experience, and relevant skills. Moreover, if you are a skilled data science professional, check out our guide on the highest-paying data science jobs in India to find out the top salaries you can expect and the relevant skills required for the job. Additionally, explore our data science course with placement assistance. The course offers industry-relevant job training and the curriculum includes modules that will help you acquire the technical skills.