Datenanalyse- Verkaufszahlen
Verkaufszahlen Analyse – Python
kaggle data source:
https://www.kaggle.com/datasets/sinjoysaha/sales-analysis-dataset
Content
This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019.
The naming convention is as follows: Sales_[MONTH_NAME]_2019
Each file contains anywhere from around 9000
to 26000
rows and 6
columns. The columns are as follows:Order ID
, Product
, Quantity Ordered
, Price Each
, Order Date
, Purchase Address
There are around 186851
data points combining all the 12-month files. There may be null values in some rows.
In [ ]:
import pandas as pd
import os
import matplotlib.pyplot as plt
Zusammenführen von 12 Monaten Verkaufsdaten in einer einzigen CSV-Datei.
In [ ]:
files = [file for file in os.listdir("D:/py/sales_data/input_csvs/")]
all_months_data = pd.DataFrame()
for file in files:
df = pd.read_csv("D:/py/sales_data/input_csvs/"+file)
all_months_data = pd.concat([all_months_data, df])
all_months_data.to_csv("D:/py/sales_data/all_data.csv", index=False)
Einlesen der Datei und anschauen der ersten Zeilen
In [ ]:
all_data=pd.read_csv("D:/py/sales_data/all_data.csv")
all_data.head()
Out[ ]:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
In [ ]:
all_data=all_data.dropna(how='all')
all_data.head()
Out[ ]:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 |
In [ ]:
#Removing rows with "Or" in "Order Date" column ?
all_data = all_data[all_data['Order Date'].str[0:2]!='Or']
all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
Out[ ]:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | |
---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 |
In [ ]:
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])
all_data['Sales'] = all_data['Quantity Ordered']*all_data['Price Each']
all_data.head()
Out[ ]:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | |
---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 |
Welcher Monat verzeichnete die höchsten Verkaufszahlen und wie hoch war der Umsatz in diesem Monat?"¶
In [ ]:
all_data.groupby('Month').sum()
all_data.groupby('Month').sum().to_csv("D:/py/sales_data/all_data_month.csv", index=False)
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\1744586910.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. all_data.groupby('Month').sum() C:\Users\gov\AppData\Local\Temp\ipykernel_3096\1744586910.py:3: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. all_data.groupby('Month').sum().to_csv("D:/py/sales_data/all_data_month.csv", index=False)
In [ ]:
months = range(1,13)
results = all_data.groupby('Month').sum()
plt.bar(months, results['Sales'])
plt.xticks(months)
labels, location = plt.yticks()
plt.yticks(labels, (labels).astype(int))
plt.ylabel('Verkäufe in USD')
plt.xlabel('Monat')
plt.show()
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\2101255051.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. results = all_data.groupby('Month').sum()
Welche Stadt hat die höchsten Verkaufszahlen?¶
In [ ]:
def get_city(address):
return address.split(',')[1]
all_data['City'] = all_data['Purchase Address'].apply(lambda x: get_city(x) + ' ')
all_data.head()
Out[ ]:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | |
---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles |
In [ ]:
results2 = all_data.groupby('City').sum()
results2
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\470953172.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. results2 = all_data.groupby('City').sum()
Out[ ]:
Quantity Ordered | Price Each | Month | Sales | |
---|---|---|---|---|
City | ||||
Atlanta | 33204 | 5559816.40 | 209588 | 5590997.16 |
Austin | 22306 | 3619747.22 | 139658 | 3639163.50 |
Boston | 45056 | 7274819.54 | 282224 | 7323284.02 |
Dallas | 33460 | 5505255.64 | 209240 | 5535950.80 |
Los Angeles | 66578 | 10842870.46 | 416650 | 10905141.60 |
New York City | 55864 | 9270741.66 | 351482 | 9328634.86 |
Portland | 28106 | 4615494.94 | 175530 | 4640981.22 |
San Francisco | 100478 | 16422923.48 | 631040 | 16524407.82 |
Seattle | 33106 | 5466592.02 | 209882 | 5495510.96 |
In [ ]:
cities = [city for city, df in all_data.groupby('City')]
plt.bar(cities, results2['Sales'])
plt.xticks(cities, rotation='vertical', size = 8)
labels, location = plt.yticks()
plt.yticks(labels, (labels).astype(int))
plt.ylabel('Verkäufe in USD')
plt.xlabel('Stadt')
plt.show()
Zu welcher Zeit sollten wir Anzeigen schalten, um die Wahrscheinlichkeit zu maximieren, dass Kunden ein Produkt kaufen?¶
In [ ]:
all_data['Order_Date_DTO'] = pd.to_datetime(all_data['Order Date'])
all_data['Hour'] = all_data['Order_Date_DTO'].dt.hour
all_data.head()
Out[ ]:
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | Month | Sales | City | Order_Date_DTO | Hour | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 4 | 23.90 | Dallas | 2019-04-19 08:46:00 | 8 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 4 | 99.99 | Boston | 2019-04-07 22:30:00 | 22 |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 600.00 | Los Angeles | 2019-04-12 14:38:00 | 14 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles | 2019-04-12 14:38:00 | 14 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 | 11.99 | Los Angeles | 2019-04-30 09:27:00 | 9 |
In [ ]:
results3 = all_data.groupby(['Hour'])['Quantity Ordered'].count()
hours = [hour for hour, df in all_data.groupby('Hour')]
plt.plot(hours, results3)
plt.xticks(hours)
plt.xlabel('Stunde')
plt.ylabel('Bestellungen Anzahl')
plt.grid()
plt.show()
Welcher Bundle Artikel würde sich empfehlen?¶
In [ ]:
new_all = all_data[all_data['Order ID'].duplicated(keep=False)]
new_all['Product_Bundle'] = new_all.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
new_all = new_all[['Order ID','Product_Bundle']].drop_duplicates()
new_all.head()
Out[ ]:
Order ID | Product_Bundle | |
---|---|---|
0 | 176558 | USB-C Charging Cable,USB-C Charging Cable |
2 | 176559 | Bose SoundSport Headphones,Bose SoundSport Hea... |
3 | 176560 | Google Phone,Wired Headphones,Google Phone,Wir... |
5 | 176561 | Wired Headphones,Wired Headphones |
6 | 176562 | USB-C Charging Cable,USB-C Charging Cable |
Welches Produkt wurde am meisten verkauft?¶
In [ ]:
product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']
products = [product for product, df in product_group]
plt.bar(products, quantity_ordered, color='b')
plt.ylabel('Bestellaufkommen', color='b')
plt.xlabel('Produktname')
plt.xticks(products, rotation='vertical', size=8)
plt.show()
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\3496259732.py:4: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. quantity_ordered = product_group.sum()['Quantity Ordered']
Gegenüberstellung: Preise der Produkte und Bestellaufkommen¶
In [ ]:
prices = all_data.groupby('Product').mean()['Price Each']
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.bar(products, quantity_ordered, color='b')
ax2.plot(products, prices, color='r')
ax1.set_xlabel('Produktname')
ax1.set_ylabel('Bestellaufkommen', color='b')
ax2.set_ylabel('Preis in USD', color = 'r')
ax1.set_xticklabels(products, rotation='vertical', size=8)
plt.show()
C:\Users\gov\AppData\Local\Temp\ipykernel_3096\4049133573.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. prices = all_data.groupby('Product').mean()['Price Each'] C:\Users\gov\AppData\Local\Temp\ipykernel_3096\4049133573.py:10: UserWarning: FixedFormatter should only be used together with FixedLocator ax1.set_xticklabels(products, rotation='vertical', size=8)