This assignment helps understand Processing and plotting different datasets using Python.
Question 1:
In this question, you will use the "Titanic" dataset. This dataset contains information
on 891 passengers, with the following columns:
Column Description
sex Gender (“male” or “female”)
age Age in years (float; some missing)
sibsp Number of siblings/spouses aboard
parch Number of parents/children aboard
fare Ticket fare in British pounds
embarked Port of embarkation (“C” = Cherbourg; “Q” = Queenstown; “S” =
Southampton)
class Passenger class (“First”, “Second”, or “Third”)
alone Was the passenger travelling alone (True or False)
survived Survival indicator (0 = No, 1 = Yes)
Write a Python program to perform the following tasks/answer the following questions:
1. Load the dataset into a Pandas DataFrame and display the first five rows.
2. Plot a pie chart showing the percentage of passengers who embarked at each port.
3. Create a barplot showing the number of passengers travelling in each class.
4. Create a scatter plot of age vs. fare, colored by class. Based on this plot, state which
passenger class generally paid the highest fares.
5. Calculate and print the mean and five-number summary of fare using
the describe() command.
6. Create a box plot of fare and identify any outliers.
7. Plot a histogram of age and describe its distribution shape.
8. Filter the female passengers data and create a bar plot of their class counts.
9. Filter the male passengers data and create a bar plot of their class counts. Compare the two
distributions.
10. Generate a heatmap of correlations among the numeric features of the data frame (age, fare,
sibsp, parch); identify the strongest and weakest correlations.
Question 2:
In this question, you will use the "Penguins" dataset. The dataset contains measurements for
penguin specimens from three islands in Antarctica. Each row represents one penguin. The data
provided include:
Column Description
species Penguin species (Adelie, Chinstrap, Gentoo)
Island Island where the penguin was observed (Torgersen, Biscoe,
Dream)
bill_length_mm Bill length in millimeters (float)
bill_depth_mm Bill depth in millimeters (float)
flipper_length_mm Flipper length in millimeters (float)
body_mass_g Body mass in grams (float)
sex Sex of the penguin (male, female)
Write a Python program to perform the following tasks/answer the following questions:
1. Load the dataset from penguins.csv into a Pandas DataFrame and display the first five
rows.
2. Use the command info() to print a description of all columns. How many penguins are
included in this data frame?
3. Construct a pie chart showing the proportion of each penguin specie.
4. Build a bar plot of penguin counts by island.
5. Calculate and print the mean and five-number summary of bill_length_mm
using describe().
6. Construct a box plot of flipper_length_mm. From the box plot, identify if there are any
outliers.
7. Plot a histogram of body_mass_g. Based on this plot, describe the distribution of the
penguins body mass.
8. Filter the data frame to "Adelie" penguins and plot a histogram of their body_mass_g.
9. Filter the DataFrame to "Gentoo" penguins and plot a histogram of their body_mass_g.
Compare with the histogram in part (8) and state which specie tend to be heavier.
10. Generate a heatmap of the correlation matrix among all numeric variables
(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g); from the heatmap,
identify the strongest correlation and the two weakest correlations.