helper function normalize_product_name()
What this function does — line by line Takes a product name as input. Converts the value to a string. Removes any extra spaces at the start or end. Checks if the cleaned string is not empty. Changes the first character to uppercase. Changes all remaining characters to lowercase. Returns the normalized version of the product name. Why I wrote it this way — line by line The dataset might contain “coat”, “Coat”, “COAT”, “cOaT”. Without normalization, these would be treated as different products. That would split counts and make the analysis wrong. Normalizing everything to the same format fixes this. It ensures consistency when grouping and counting products. What this function returns — line by line A clean, consistently formatted product name. Always in “Proper Case”: first letter uppercase, rest lowercase.
helper function is_numeric()
✔ What this function does — line by line • Takes any value as input. • Tries to convert it into a numeric type. • If conversion works, it decides the value is numeric. • If conversion fails, it decides the value is not numeric. ✔ Why I wrote it this way — line by line • Some values that look like numbers may actually contain text or spaces. • I need a safe way to check values before using them in calculations. • This avoids runtime errors in later analysis. • It helps filter out invalid or dirty numeric fields. ✔ What this function returns — line by line • True when the value can be converted into a number. • False when the value cannot be converted into a number.
helper function find_column_by_keywords()
✔ What this function does — line by line • Takes a DataFrame and a list of keywords. • Looks through every column name in the DataFrame. • Converts each column name to lowercase for comparison. • Checks if any of the given keywords appear inside that column name. • Returns the first column name that matches any keyword. • If no match is found and a fallback index is given, uses the fallback column. • If no match and no valid fallback, returns nothing. ✔ Why I wrote it this way — line by line • Real datasets may not always use the exact same column names. • One file might use “Product”, another “Item Purchased”. • I want my code to work across multiple datasets without manual changes. • Using keywords makes the code flexible and reusable. ✔ What this function returns — line by line • A column name that matches one of the keywords. • Or a fallback column, if specified and needed. • Or None, when nothing suitable is found.
helper function get_count()
get_count() (and similarly get_percentage()) ✔ What this function does — line by line • Takes a pair (a tuple) as input. • The tuple usually contains: something like (product or shipping type, count or percentage). • Extracts the second element from that tuple. • Returns that second element. ✔ Why I wrote it this way — line by line • When sorting lists of tuples, I need to tell Python what to sort by. • In this case, I want to sort by the numeric value (count or percentage). • Using a small helper function keeps the sort logic clean and readable. • It avoids repeating indexing logic everywhere. ✔ What this function returns — line by line • The numeric count or percentage that should be used for sorting.
helper function get_percentage()
get_count() (and similarly get_percentage()) ✔ What this function does — line by line • Takes a pair (a tuple) as input. • The tuple usually contains: something like (product or shipping type, count or percentage). • Extracts the second element from that tuple. • Returns that second element. ✔ Why I wrote it this way — line by line • When sorting lists of tuples, I need to tell Python what to sort by. • In this case, I want to sort by the numeric value (count or percentage). • Using a small helper function keeps the sort logic clean and readable. • It avoids repeating indexing logic everywhere. ✔ What this function returns — line by line • The numeric count or percentage that should be used for sorting.
one clean_dataset()
✔ What this function does — line by line • Takes the raw shopping dataset as input. • Creates a new copy of the dataset so the original is not changed. • Searches for the product column by looking for words like “product” or “item” in column names. • Searches for the shipping column by looking for words like “shipping” or “ship” in column names. • If no product column is found, uses the first column as a fallback. • If no shipping column is found, uses the second column as a fallback. • Goes through each row and normalizes the product names using the normalization function. • Replaces the original product column with these normalized product names. • Goes through each row and strips extra spaces from the shipping type. • Replaces the original shipping column with the cleaned shipping values. • Starts validating rows to remove bad data. • For each row, converts the product value to lowercase for checking. • Converts the shipping value to lowercase for checking. • Checks if product is not empty and not equal to the string “nan”. • Checks if shipping is not empty and not equal to the string “nan”. • If both product and shipping values are valid, keeps that row. • Builds a list of indices of valid rows. • Filters the DataFrame to keep only these valid rows. • Returns the cleaned dataset and the identified column names. ✔ Why I wrote it this way — line by line • I want all later analysis to be based only on valid rows. • Product and shipping fields must be present and meaningful. • Normalized product names prevent fragmented counts. • Clean shipping types prevent mis-grouping. • Auto-detection of column names makes the code reusable with other datasets. ✔ What this function returns — line by line • A cleaned DataFrame containing only valid product and shipping records. • The name of the product column used. • The name of the shipping column used.
two analyze_products_shipping_types()
✔ What this function does — line by line • Takes the cleaned dataset and the names of the product and shipping columns. • Groups the data by product and shipping type together. • Counts how many times each product–shipping combination appears in the dataset. • Builds a nested structure that stores, for each product, how many orders each shipping type has. • For each product, calculates how many total orders that product has across all shipping types. • For each shipping type of that product, calculates what percentage of that product’s orders used that shipping type. • Sorts the shipping types for each product from highest percentage to lowest. • Collects all unique product names into a list. • Sorts that list of products alphabetically. • Returns the sorted list of products and the shipping percentages per product. ✔ Why I wrote it this way — line by line • I wanted to answer: “Which shipping types are most popular for each product?” • Grouping by both product and shipping shows the relationship between them. • Percentages are easier to interpret than raw counts. • Sorting shows the most popular shipping method first. • Having everything structured by product makes it easy to print and explain. ✔ Why I used groupby — line by line • Grouping by product and shipping type is exactly what is needed here. • It automatically counts all combinations in one step. • It is more efficient and cleaner than writing nested loops. • It ensures accurate aggregation across the full dataset. ✔ What this function returns — line by line • A sorted list of all unique products. • A dictionary mapping each product to a sorted list of shipping types and their percentages.
three print_product_shipping_analysis()
✔ What this function does — line by line • Takes the sorted list of products. • Takes the dictionary with shipping percentages for each product. • For each product, prints the product name as a heading. • Under each product, prints each shipping type used. • Prints the percentage of orders using that shipping type. • Prints the percentages with two decimal places. • Maintains the order from most popular to least popular shipping type. ✔ Why I wrote it this way — line by line • The analysis results should be easy to read and understand. • Marketing and logistics teams often prefer a simple text summary. • This format can be easily copied into a report or email. • Displaying percentages helps interpret customer preferences quickly. ✔ What this function returns — line by line • It does not return any data. • Its purpose is to display formatted, human-readable output.
four task1() — The Overall Workflow
✔ What this function does — line by line • Acts as the main controller for Task 1. • Takes the raw dataset as input. • Calls the cleaning function to clean and normalize the dataset. • Receives the cleaned DataFrame and the identified product and shipping columns. • Calls the analysis function to compute shipping preferences for each product. • Receives the list of products and the shipping percentages per product. • Calls the print function to display the results in a readable format. • Completes the full flow from raw data to final output. ✔ Why I wrote it this way — line by line • I wanted one single entry point for Task 1. • This makes it very easy to run the whole task with one function call. • It separates responsibilities: o Cleaning is done in one place. o Analysis is done in another place. o Printing is done in another place. • This structure is easier to maintain, debug, and present. ✔ How the functions work together — line by line • The main function starts with raw data. • The cleaning function ensures only good product and shipping values remain. • The analysis function counts and calculates shipping preferences for each product. • The printing function displays the final results clearly. • Together, they transform messy raw data into useful business insights. ✔ What this function returns — line by line • It does not return data. • Its role is to coordinate everything and produce final printed output.
1 Clean_data_for_segments(df)
⭐ Function 1: clean_data_for_segments() ✔ What this function does — line by line Cleans the raw dataset used for customer segmentation. Creates a fresh copy of the dataset so the original is never changed. Searches column names for the purchase amount column. Searches column names for the previous purchase column. Searches column names for the gender column. Applies flexible matching so it still finds the right columns even if the names differ. Cleans all gender values by stripping extra spaces. Converts gender values to a consistent lowercase format. Loops through every row to check if purchase amount is numeric. Checks if previous purchases is numeric. Checks if gender is valid (male or female only). Converts valid purchase numbers to floats. Keeps only rows that pass all validation checks. Returns the cleaned dataset. Returns the detected purchase amount column. Returns the detected previous purchase column. Returns the detected gender column. ✔ Why I wrote it this way — line by line Real datasets often have inconsistent column names. I needed flexible keyword detection instead of hardcoding column names. Gender values often include extra spaces or odd formats. Segmentation requires accurate numerical values. Removing invalid rows prevents errors later in the analysis. Only clean data should be used for computing segment numbers. ✔ What this function returns — line by line A cleaned DataFrame containing only valid rows. The correct purchase amount column name. The correct previous purchase column name. The correct gender column name.
2 Calculate_segment(total_purchased)
✔ What this function does — line by line Takes a customer’s total spending value. Checks if the spending amount is negative. Rejects invalid negative values. Uses a $500 range to determine the customer’s segment. Divides the spending by 500. Converts the division result into a segment number using integer division. Adds 1 so segment numbers start at 1, not 0. Ensures the maximum segment is 12. Handles the special case where spending equals exactly $6000. Returns the final segment number. ✔ Why I wrote it this way — line by line Retail segmentation often uses fixed spending brackets. $500-wide segments evenly divide customers across spending levels. Integer division creates consistent, predictable grouping. Restricting segments to a maximum of 12 prevents out-of-range segments. ✔ How segment numbers are calculated — line by line Total spending = purchase × previous purchases. Divide total spending by 500. Floor the value using integer conversion. Add 1 to represent the segment. Cap the result at 12 if needed. ✔ What this function returns — line by line A segment number between 1 and 12. Or None if the spending is invalid.
3 Create_segment_visualization(df_clean, gender_column)
✔ What this function does — line by line Prepares a bar chart showing male and female counts in each segment. Creates lists for all 12 segments. Uses a grouped count (segment + gender) to count customers. Extracts male counts for each segment. Extracts female counts for each segment. Creates side-by-side bars for male vs female. Adds labels to the x-axis showing Segment1 through Segment12. Adds y-axis label for number of customers. Adds a title for the visualization. Adds a legend for male vs female. Adds grid lines to make comparisons easier. Annotates each bar with the exact number of customers. Displays the final chart. ✔ Why I wrote it this way — line by line Visual patterns help reveal gender differences in spending behavior. Side-by-side bars make male vs female comparisons clear. Annotation adds the exact numbers for easier interpretation. Fixed segment order ensures a consistent 1–12 layout. ✔ Why groupby is used here — line by line It counts customers by segment and gender in one operation. It avoids manual nested loops. It keeps the code clean and efficient. It automatically handles all unique combinations of segment + gender. ✔ What this visualization shows — line by line The number of male customers in each spending segment. The number of female customers in each spending segment. Whether males or females dominate certain segments. Which segments have the largest customer populations. Potential marketing patterns.
4 Print_segment_population(df_clean, gender_column)
✔ What this function does — line by line Prints the number of males in each segment. Prints the number of females in each segment. Uses grouped counts to extract values. Prints results from Segment 1 to Segment 12. ✔ Why I wrote it this way — line by line Business users often want numbers, not just charts. This provides a clear, easy-to-read summary. It supports the visual chart with exact values. ✔ What this function returns — line by line Nothing — it prints the results for easy viewing.
5 Task2(df)
✔ What this function does — line by line Runs the entire customer segmentation workflow. Calls the cleaning function to prepare the dataset. Computes total spending for each customer. Uses the segmentation function to assign segments. Removes customers who don't fit into valid segments. Generates the gender-based visualization. Prints the segment populations for males and females. ✔ Why I wrote it this way — line by line Keeps all steps organized in one controller function. Ensures each helper function handles one responsibility. Makes the analysis clear and readable. Allows the whole process to run automatically from start to finish. ✔ How all functions work together — line by line Cleaning prepares accurate data. Segment calculation assigns spending groups. Visualization shows gender distribution in each segment. Printing gives exact numbers for deeper insights. Together, they complete a full customer segmentation analysis system. ✔ What this function returns — line by line Nothing — it produces the chart and printed results.
1 Clean_data_for_task3(df)
✔ What this function does — line by line Creates a fresh copy of the dataset so the original file is not modified. Searches for the product column by looking for keywords like “product” or “item”. Searches for the age column by looking for the word “age”. Searches for the previous purchase column by looking for “previous purchase”. If these columns are not found, it performs a second, looser search using more flexible keywords. Normalizes product names using the normalize_product_name helper function. Removes extra spaces from each age value. Loops through every row to check if the product field is valid. Checks if the age field is valid and not empty. Checks if the previous purchase value is numeric. Converts previous purchase values to floats for accurate calculations. Keeps only the rows that have valid product names, valid age group, and numeric previous purchases. Returns the cleaned dataset and the detected column names. ✔ Why I wrote it this way — line by line Product names may appear in inconsistent case formats (Coat / coat / COAT). Age groups often include stray spaces or variations in formatting. Previous purchase counts must be numeric for accurate calculations. Real datasets may have missing or incorrect column names, so flexible detection is required. Removing invalid rows ensures the metrics A and B are calculated correctly. ✔ What this function returns — line by line A cleaned DataFrame containing only valid rows. The detected product column name. The detected age column name. The detected previous purchase column name.
2 Calculate_metrics_for_product(df_product, age_column, previous_purchase_column)
✔ What this function does — line by line Takes a subset of the dataset containing only one specific product. Groups this subset by age group. Calculates the average previous purchases for each age group. Stores each age-group average into a list. Computes Metric A as the average of these age-group averages. Metric A gives equal weight to each age group, even if some age groups have fewer customers. Computes Metric B as the overall average previous purchases for the entire product. Metric B gives more weight to age groups with more customers. Returns both Metric A and Metric B. ✔ Why I wrote it this way — line by line Metric A helps understand how the product performs across age groups equally. Metric B reflects the real popularity because groups with more customers have more weight. Comparing A and B helps detect whether certain age groups dominate the product’s purchases. The difference between A and B highlights demographic buying trends. ✔ What this function returns — line by line Metric A: average of all age-group averages. Metric B: overall average of all previous purchases for the product.
3 Calculate_all_product_metrics(df_clean, product_column, age_column, previous_purchase_column)
✔ What this function does — line by line Finds all unique product names in the dataset. Stores each product only once. Creates two empty dictionaries: one for Metric A values, one for Metric B values. Loops through each product. Extracts all rows for the current product. Calls calculate_metrics_for_product() to compute Metric A and Metric B. Stores Metric A in the first dictionary under the product’s name. Stores Metric B in the second dictionary under the product’s name. Returns the list of unique products and both metric dictionaries. ✔ Why I wrote it this way — line by line I wanted to calculate Metrics A and B for every product automatically. Using dictionaries keeps the data organized and easy to access later. Having all results collected makes the visualization step straightforward. ✔ What this function returns — line by line A list of all unique products. A dictionary of Metric A values for all products. A dictionary of Metric B values for all products.
4 Create_product_metrics_visualization(unique_products, metrics_A, metrics_B)
✔ What this function does — line by line Receives the list of products and both metric dictionaries. Sorts the product names alphabetically for clean display. Creates two lists: one for Metric A values and one for Metric B values. Generates a grouped bar chart comparing Metric A and Metric B for each product. Each product gets two bars: one for Metric A and one for Metric B. Labels the x-axis with product names. Labels the y-axis with “Average Previous Purchases”. Adds a title to describe the chart. Rotates x-axis labels for readability. Adds a legend to distinguish Metric A from Metric B. Adds grid lines to make comparison easier. Annotates each bar with its numeric value to show exact values. Displays the final visualization. ✔ Why I wrote it this way — line by line Comparing Metric A and Metric B visually makes demographic effects easy to see. Bar charts are the clearest way to compare values across many products. Sorted product names make the chart easier to read. Adding the exact values on the bars helps during presentations and reports. ✔ What this visualization shows — line by line How evenly a product performs across age groups (Metric A). How heavily one or more age groups dominate the product’s purchases (Metric B). Whether the product appeals differently across age demographics. Which products have strong or weak age-related buying patterns.
5 Print_products_where_A_less_than_B(unique_products, metrics_A, metrics_B)
✔ What this function does — line by line Loops through every product. Compares Metric A and Metric B for that product. Checks if Metric A is less than Metric B. If yes, adds the product to a list. After checking all products, prints only those where A < B. If none match, prints a message saying so. ✔ Why I wrote it this way — line by line Metric A < Metric B means larger age groups buy more of the product. This flags products where specific age groups strongly dominate purchases. This helps identify age-biased products for targeted marketing. ✔ What this function returns — line by line Nothing — it prints a summary list for easy interpretation.
6 Task3(df)
✔ What this function does — line by line Runs the full workflow for Task 3. Cleans the dataset using clean_data_for_task3(). Receives the cleaned dataset and column names. Calculates Metrics A and B for every product. Creates the Metric A vs Metric B visualization. Prints the list of products where Metric A < Metric B. Completes the entire analysis process from input to final insights. ✔ Why I wrote it this way — line by line One function should control the whole task. Keeps the flow organized and easy to run. Separates cleaning, calculation, visualization, and reporting. Makes the code modular and easier to debug. ✔ How the functions work together — line by line The cleaning function ensures only valid data moves forward. The metric function computes detailed statistics for each product. The aggregation function collects results for all products. The visualization function compares metrics across products. The print function highlights special cases. The controller function ties everything together cleanly. ✔ What this function returns — line by line Nothing — it produces the chart and printed results.
1 Clean_data_for_task4(df)
✔ What this function does — line by line • Creates a fresh copy of the dataset so the original file is untouched. • Searches all column names for one that contains the keyword “date”. • If no date column is found, uses the last column in the dataset as a fallback. • Loops through every row in the dataset. • Extracts the date value and removes extra spaces. • Checks if the date string is not empty and not equal to “nan”. • Attempts to convert the date string into a date object using the format day.month.year. • If the date is valid, the row is marked as acceptable. • If the date is invalid, the row is ignored. • Collects only the valid rows. • Returns the cleaned dataset and the detected date column. ✔ Why I wrote it this way — line by line • Many datasets have inconsistent date formatting. • Invalid dates would break seasonal and monthly analysis. • Using fallback column detection makes the function robust with different datasets. • Removing empty or corrupt dates ensures reliable season calculations later. • Cleaning the dataset early prevents errors in visualizations. ✔ What this function returns — line by line • A cleaned DataFrame containing only rows with valid dates. • The name of the column where the valid dates were found.
2 Get_season_from_date(date_str)
✔ What this function does — line by line • Takes a date string as input. • Attempts to convert the string into a date object. • Extracts the month from the date. • Extracts the day from the date. • Checks if the date falls between March 21 and June 20 → Spring. • Checks if the date falls between June 21 and September 20 → Summer. • Checks if the date falls between September 21 and December 20 → Fall. • If none of the above, assigns the date to Winter. • Returns the season name as a string. • If the date is invalid, returns None. ✔ Why I wrote it this way — line by line • Standard meteorological seasons start on specific dates. • This approach gives more accurate results than using only months. • It ensures customers in early March or late December are placed in the correct season. • The function isolates season detection, improving modularity. ✔ What this function returns — line by line • One of the season names: “Spring”, “Summer”, “Fall”, or “Winter”. • None if the date cannot be processed.
3 Visualize_season_popularity(df_clean, date_column)
✔ What this function does — line by line • Creates an empty dictionary to store season counts. • Loops through every row in the cleaned dataset. • Extracts the date value for the row. • Uses the season-detection function to determine the season. • Counts how many sales belong to each season. • Arranges the seasons in a fixed order: Spring, Summer, Fall, Winter. • Builds lists for the seasons and their corresponding counts. • Creates a bar chart showing number of sales per season. • Colors the bars to represent different seasons. • Labels the x-axis with the season names. • Labels the y-axis with number of sales. • Adds a chart title describing seasonal popularity. • Adds grid lines for readability. • Adds a number on top of each bar showing exact sales count. • Displays the final chart. ✔ Why I wrote it this way — line by line • Seasonal trends are important for marketing and inventory planning. • A bar chart clearly shows which season has the most sales. • Using fixed season order keeps charts consistent for every dataset. • Annotating bars makes the chart presentation-ready. • The chart helps easily detect high and low sales seasons. ✔ What this visualization shows — line by line • Which season has the highest number of sales. • Which season has the lowest number of sales. • Whether business is seasonal or steady across the year. • Useful insights for seasonal marketing, stocking, or promotions.
4 Get_month_name(month_num)
✔ What this function does — line by line • Takes a numeric month value between 1 and 12. • Converts this number into the corresponding month name. • Uses a predefined list of month names. • If the number is outside 1–12, returns None. ✔ Why I wrote it this way — line by line • Date objects give a month number, but charts should show month names. • A simple conversion keeps visualization clean. • Avoids repeated month-name logic in multiple locations. ✔ What this function returns — line by line • The full month name (e.g., “January”, “February”). • Or None if the input number is invalid.
5 Visualize_month_popularity(df_clean, date_column)
✔ What this function does — line by line • Creates a dictionary to count how many sales occur in each month. • Loops through every row in the dataset. • Attempts to convert each date string into a date object. • Extracts the month number from each valid date. • Converts the month number into a month name. • Counts each sale for the corresponding month. • Sorts months by count to identify the top 3 months. • Defines a consistent month order from January to December. • Builds ordered lists of months and their counts. • Creates a bar chart showing sales per month. • Labels the x-axis with month names. • Labels the y-axis with number of sales. • Adds a title describing month popularity. • Rotates month names for readability. • Adds grid lines for clarity. • Adds numeric annotations above each bar. • Displays the chart. • Returns the top 3 months with the highest sales. ✔ Why I wrote it this way — line by line • Monthly sales trends are useful for detailed planning. • Some months may be high-season, others low-season. • Bar charts are the best way to compare 12 categories side-by-side. • Returning the top 3 months highlights the most important selling periods. ✔ What this visualization shows — line by line • How sales are distributed across all 12 months. • Whether the business has monthly spikes or dips. • Which months have the strongest sales performance. • Which months could benefit from marketing improvements.
6 Visualize_product_yearly_sales(df_clean, product_name, product_column, date_column)
✔ What this function does — line by line • Filters the dataset to include only rows for the selected product. • Checks if the product exists in the dataset; prints a message if not. • Creates a dictionary for counting sales per year. • Loops through every row of the filtered product data. • Converts each date string into a date object. • Extracts the year from the date. • Increments the count for that year. • Sorts the years in chronological order. • Creates lists of sorted years and their corresponding counts. • Generates a bar chart showing product sales by year. • Adds x-axis labels showing the years. • Adds y-axis labels showing the number of sales. • Adds a title referring to that specific product. • Adds grid lines for readability. • Adds annotations showing exact yearly sales. • Displays the chart. ✔ Why I wrote it this way — line by line • Yearly sales show long-term performance trends. • Helps detect product growth or decline over time. • Useful for decisions about continuing, discontinuing, or promoting a product. • Visual clarity is crucial when comparing many years. ✔ What this visualization shows — line by line • How the product performed across all available years. • Whether sales are rising, falling, or stable. • Which years were peak performance years. • Whether trends align with seasonal or monthly patterns.
7 Task4(df)
✔ What this function does — line by line • Begins the full Task 4 workflow. • Cleans the dataset using the date-cleaning function. • Detects the date column to use for analysis. • Searches for a column that contains product names. • If no product column is found, uses the first column as a fallback. • Calls the season visualization function to display seasonal sales patterns. • Calls the month visualization function to display monthly sales patterns. • Stores the top 3 most popular months. • Prints the top 3 months in ranking order. • Identifies all unique product names from the dataset. • Loops through each unique product. • Calls the yearly-sales visualization for each product. • Finishes the entire Task 4 analysis end-to-end. ✔ Why I wrote it this way — line by line • One controller function makes running the task simple. • Separates the analysis into seasonal, monthly, and yearly perspectives. • Visualizes both general trends and product-specific trends. • Ensures all functions are used in a structured, logical order. • Creates a complete overview of how time affects sales. ✔ How all functions work together — line by line • Cleaning ensures all date values are valid before analysis. • Season detection determines which season each sale belongs to. • Seasonal visualization shows broad seasonal trends. • Monthly visualization shows month-level patterns and top 3 months. • Yearly visualization breaks trends down by product and time. • Together, they create a full time-based analysis pipeline. ✔ What this function returns — line by line • It does not return data. • It produces multiple charts and printed summaries.
1 Clean_data_for_task5(df)
✔ What this function does — line by line • Creates a fresh copy of the dataset to avoid modifying the original. • Searches for the product column by looking for “product” or “item”. • Searches for the purchase amount column by looking for “purchase amount” or “purchase usd”. • Searches for the previous purchase column by looking for “previous purchase”. • Searches for the gender column by looking for “gender”. • Searches for the age column by looking for “age” or “group”. • If any required column is not found, assigns fallback columns based on position. • Normalizes product names so “coat”, “Coat” and “COAT” are all treated the same. • Strips spaces from gender values and converts them to lowercase. • Cleans age values by removing extra spaces. • Loops through each row to ensure product name is valid. • Checks if gender is either male or female. • Checks if age is not empty or “nan”. • Validates that purchase amount and previous purchases are numeric when they exist. • Keeps only rows with valid product, gender, age, and numeric purchase values. • Calculates Total Purchased USD as purchase amount multiplied by previous purchases. • Assigns each customer to a spending segment using the segment calculation function. • Adds the new “Total Purchased USD” and “Segment” columns to the dataset. • Returns the cleaned dataset along with product, gender, and age column names. ✔ Why I wrote it this way — line by line • Recommendation systems require clean, reliable data. • Product names must be consistent to avoid splitting counts. • Gender and age must be accurate to analyze customer group preferences. • Numeric fields must be validated before computing totals. • Segment calculation requires accurate spending data. • The cleaning step ensures all other functions receive valid input. ✔ What this function returns — line by line • The cleaned DataFrame containing only valid customer rows. • The name of the product column. • The name of the gender column. • The name of the age column.
2 Analyze_product_preference_by_segment(df_clean, product_column)
✔ What this function does — line by line • Checks if the dataset has a "Segment" column. • Groups the cleaned dataset by segment and product. • Counts how many times each product appears in each segment. • Stores these counts in a nested dictionary: o First level: segment o Second level: product o Value: count • For each segment, creates a list of products with their counts. • Sorts the products by how often they were purchased within that segment. • Extracts the top 3 products for each segment. • Stores the top 3 products for every segment in a final dictionary. • Returns this dictionary of recommendations by segment. ✔ Why I wrote it this way — line by line • High spenders and low spenders may prefer different products. • Grouping by segment reveals patterns in spending behavior. • Knowing which products dominate each segment helps with targeted marketing. • The top 3 products per segment simplify the results for real-world use. ✔ Why I used groupby — line by line • It automatically counts all combinations of segments and products. • It eliminates the need for nested loops. • It is more efficient, reliable, and readable. • It organizes data perfectly for recommendation systems. ✔ What this function returns — line by line • A dictionary mapping each segment to its top 3 recommended products.
3 Analyze_product_preferences_by_age(df_clean, product_column, age_column)
✔ What this function does — line by line • Checks if the age column exists. • Groups the cleaned dataset by age group and product. • Counts how many purchases exist for each age–product pair. • Builds a dictionary where each age group has its own product count list. • For each age group, sorts products by popularity. • Selects the top 3 products purchased by each age group. • Stores these top 3 lists in a dictionary. • Returns the dictionary containing recommendations by age group. ✔ Why I wrote it this way — line by line • Different age groups have different shopping preferences. • Younger ages might prefer modern items; older ages might prefer practical items. • Grouping by age helps identify generational buying patterns. • The top 3 products per age group provide actionable marketing insights. ✔ Why I used groupby — line by line • It organizes the dataset by age group efficiently. • Automatically counts product frequencies for each age group. • Handles all age groups without manual iteration. • Is more accurate than manually counting. ✔ What this function returns — line by line • A dictionary mapping each age group to their top 3 preferred products.
4 Visualize_segment_recommendation(segment_top_products)
✔ What this function does — line by line • Checks if there are segment recommendations available. • Extracts all unique segments that have valid data. • Counts how many segments each product appears in as a top product. • Builds a dictionary of product → number of segments recommending it. • Converts this dictionary into a sortable list. • Sorts the products by how widely they are recommended across segments. • Takes the top 10 products that appear most across segments. • Creates a bar chart showing the number of segments each top product appears in. • Sets labels, a title, and rotates product names for clarity. • Adds grid lines to help interpret the data. • Annotates the bars with the number of segments. • Displays the final chart. ✔ Why I wrote it this way — line by line • Some products appeal to many customer segments. • Products that appear in many segments are strong universal recommendations. • Visualizing this makes it easy to understand product reach. • Retailers can focus on promoting products with the broadest appeal. ✔ What this visualization shows — line by line • Which products are recommended across the highest number of segments. • Which products are only popular in a few segments. • Which items have the strongest universal appeal. • A clear ranking of top recommended items.
5 Print_recommendation_by_segments(segment_top_products)
What this function does — line by line • Checks if there are recommendations for segments. • Sorts the segments numerically. • For each segment, prints a header “Segment X”. • Prints the top 3 recommended products in order. • Displays a clean, easy-to-read list. ✔ Why I wrote it this way — line by line • Marketing teams often prefer printed lists over charts. • Easy to insert into documents, slides, or reports. • Helps teams understand exactly what to recommend to each customer class. ✔ What this function returns — line by line • Nothing — this is a pure presentation/printing function.
6 print_recommendation_by_age(age_top_products)
✔ What this function does — line by line • Checks if age recommendations exist. • Sorts age groups alphabetically or numerically. • Prints a header like “Age Group 30–40”. • Prints the top 3 recommended products for that age group. • Organizes recommendations clearly and neatly. ✔ Why I wrote it this way — line by line • Retailers need to understand age-based buying patterns. • Lists are easy for managers to read and act on. • Supports age-targeted advertising and promotions. ✔ What this function returns — line by line • Nothing — this function prints human-readable results.
7 Task5(df)
✔ What this function does — line by line • Begins the entire recommendation analysis process. • Cleans the dataset using clean_data_for_task5(). • Ensures only valid customer records enter the analysis pipeline. • Analyzes product preferences for each customer segment. • Analyzes product preferences for each age group. • Creates a visualization of the overall top recommended products across segments. • Prints segment-based recommendations. • Prints age-based recommendations. • Completes the end-to-end recommendation system. ✔ Why I wrote it this way — line by line • A single controller function organizes the entire workflow. • Ensures all necessary steps run in the correct order. • Keeps the project modular, clean, and easy to maintain. • Produces both visual and printed outputs. • Provides clear, actionable business insights. ✔ How all functions work together — line by line • Cleaning prepares high-quality data. • Segment preference analysis finds spending-group favorites. • Age preference analysis finds generational favorites. • Visualization shows product popularity across segments. • Printing functions summarize exact recommendations. • Together, they create a complete recommendation engine. ✔ What this function returns — line by line • Nothing — it triggers all analysis and displays all results.

Study Now!

Turn your Google Sheets into flashcards and study now!

Mobile Apps

Review your study material on the go!

Get it on Google Play
Download on the App Store

Extensions

Learn as you browse the web and study decks directly in Google Sheets.