Pandas is a powerful open-source data analysis and manipulation tool built on top of the Python programming language. It provides data structures and functions needed to work with structured data seamlessly and efficiently.
To start using Pandas, you'll need to install it first:
pip install pandas
This image represents a Pandas DataFrame table structure, showcasing how data is organized in rows and columns. A crucial tool for data analysis!
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.
Here's a quick example to show how Pandas works:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [24, 27, 22]}
df = pd.DataFrame(data)
print(df)
Let's explore some fundamental operations you can perform with Pandas.
The image above represents the various methods Pandas offers to read from and write to different file formats. This functionality is essential for data manipulation and analysis. Whether youโre dealing with CSV, Excel, JSON, or SQL databases, Pandas provides straightforward functions to import and export data seamlessly. Using these methods, you can efficiently bring data into your workspace and export your analysis results, making Pandas an incredibly versatile tool for data scientists and analysts. ๐
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, โฆ), each of them with the prefix read_*.:
import pandas as pd
# Reading a CSV file
df = pd.read_csv('filename.csv')
print(df.head()) # Display the first 5 rows
Itโs essential to understand your data. Pandas provides several methods to inspect the data:
# Display the first few rows
print(df.head())
# Display summary statistics
print(df.describe())
# Display information about the DataFrame
print(df.info())
You can filter data based on certain conditions:
# Filter rows where age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)
The image shows selecting specific columns from a Pandas DataFrame, making data extraction easy and efficient. ๐
You can select any columns. For example, to select the 'Age' column:
import pandas as pd
# Example DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Select the 'Age' column
age_column = df['Age']
print(age_column)
We can select specific rows and columns from a DataFrame. For example:
import pandas as pd
# Example DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Select specific rows and columns (e.g., rows 0 and 2, and columns 'Name' and 'Age')
selected_data = df.loc[[0, 2], ['Name', 'Age']]
print(selected_data)
You can filter rows in a DataFrame based on specific conditions. For example, to select rows where the age is greater than 30:
import pandas as pd
# Example DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Filter rows where Age is greater than 30
filtered_data = df[df['Age'] > 30]
print(filtered_data)
Pandas provides a convenient way to create plots using the built-in plotting functions. You can visualize your data quickly with just a few lines of code. For example, to create a simple line plot:
import pandas as pd
import matplotlib.pyplot as plt
# Example DataFrame
data = {
'Year': [2020, 2021, 2022, 2023],
'Sales': [150, 200, 250, 300]
}
df = pd.DataFrame(data)
# Create a line plot
df.plot(x='Year', y='Sales', kind='line', marker='o')
plt.title('Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid()
plt.show()