Lesson 3 - Lists and For Loops

In this lesson, you’ll learn how to work with data in Python using lists and loops. We’ll practice basic techniques to store information in lists, access data by indexing, slice lists to get specific segments, and use for loops to process large amounts of data quickly. We’ll also explore how to read a data file (in this case, PageGarden.csv) and transform it into a manageable structure for analysis.

What is PageGarden.csv? PageGarden.csv (Download Here) is a fictional dataset representing an online bookstore called “PageGarden.” Each row in this dataset describes a single book. For example, each row might include:

  • The book title (text)
  • The price (a number, e.g., 0.0 if it’s free)
  • The currency (e.g., “USD”)
  • The total number of customer reviews (a large integer)
  • The average user rating (a floating-point number)

A few sample rows from PageGarden.csv might look like this (note: this is sample data, not real):

title,price,currency,total_reviews,avg_rating
"Whispering Leaves",12.99,USD,1050300,4.2
"Berry Tales",0.0,USD,985500,4.5
"Moon Over Pine",7.50,USD,899000,4.6
"Desert Echoes",15.00,USD,1724500,4.4
"Golden Harvest",9.99,USD,990000,4.3
  • The first row is called the “header” row. It describes what each column represents.
  • Each subsequent row describes one book. For example, "Whispering Leaves" has:
    • price = 12.99
    • currency = USD
    • total_reviews = 1050300
    • avg_rating = 4.2

Over the course of these lessons, we’ll learn to read this file, store it in Python as a list of lists, and then perform simple data analysis tasks (like finding the average rating of all books).


Understanding Lists

When working with data, it’s essential to store and organize information in a way that’s easy to manage. One of the most basic and useful data structures in Python is the list.

A list is a collection of items placed in a specific order and enclosed in square brackets [ ]. These items are often called “elements” of the list. Lists are very flexible because they can store elements of many different data types—such as text (strings), numbers (integers and floats), and even other lists.

Why use lists for this data? If we think about one book from our PageGarden.csv dataset, it has several pieces of information:

  • The title (e.g., "Whispering Leaves")
  • The price (e.g., 12.99)
  • The currency (e.g., "USD")
  • The total number of reviews (e.g., 1050300)
  • The average rating (e.g., 4.2)

We could create a separate variable for each piece of information, like this:

title = "Whispering Leaves"
price = 12.99
currency = "USD"
total_reviews = 1050300
avg_rating = 4.2

But imagine we had thousands of books! Creating so many variables would be messy and hard to manage.

Instead, we can use a list to store all these pieces of information about one book together:

book_1 = ["Whispering Leaves", 12.99, "USD", 1050300, 4.2]

Now, book_1 holds all the information for this single book in one place. We don’t have to keep track of multiple variables—just one list. This makes our code shorter, easier to read, and more scalable (meaning it’s easier to expand if we have more books).

Check the type:

print(book_1)
print(type(book_1))
  • print(book_1) might show something like ["Whispering Leaves", 12.99, "USD", 1050300, 4.2].
  • print(type(book_1)) will confirm that book_1 is a list.

How to create a list:

  • Write the elements separated by commas.
  • Enclose them in square brackets.

For example:

sample_list = ["Hello", 10, 3.5]

Here:

  • “Hello” (a string),
  • 10 (an integer), and
  • 3.5 (a float)

are all stored together in one list named sample_list.

Exercise:

  1. Create a list named book_2 that represents another book from our PageGarden data. Use the following details:

    • Title: "Berry Tales"
    • Price: 0.0
    • Currency: "USD"
    • Total reviews: 985500
    • Average rating: 4.5

    In code:

    book_2 = ["Berry Tales", 0.0, "USD", 985500, 4.5]
  2. Create another list named book_3 for:

    • Title: "Moon Over Pine"
    • Price: 7.50
    • Currency: "USD"
    • Total reviews: 899000
    • Average rating: 4.6
    book_3 = ["Moon Over Pine", 7.50, "USD", 899000, 4.6]

These two exercises will help you get comfortable with creating lists. In the next section, we’ll learn how to access elements inside these lists.

Accessing Elements in a List (Indexing)

Now that we know how to create a list, the next step is understanding how to access the information stored inside it. Since lists often contain multiple elements, we need a systematic way to identify and retrieve each piece of data. In Python, this system is called indexing.

What is an index? An index is a numeric position assigned to each element in the list. Indexing in Python starts at 0 for the first element, 1 for the second, and so on. This zero-based indexing is a common feature in many programming languages.

Consider our book_1 list from the previous section:

book_1 = ["Whispering Leaves", 12.99, "USD", 1050300, 4.2]

Let’s write down the index numbers for each element:

  • Index 0 → "Whispering Leaves"
  • Index 1 → 12.99
  • Index 2 → "USD"
  • Index 3 → 1050300
  • Index 4 → 4.2

If we want just the book’s title, "Whispering Leaves", we can access it by using its index:

title = book_1[0]
print(title)  # This will print: Whispering Leaves

To access the average rating (which is at index 4):

rating = book_1[4]
print(rating)  # This will print: 4.2

Why does indexing matter? Indexing lets you pick out exactly the piece of data you need from a list. This is especially useful when you’re dealing with many items. Imagine that PageGarden.csv has thousands of rows—indexing allows you to write generic code to handle each row without having to manually name variables for each piece of data.

Indexing for Large Datasets Soon, we’ll load the entire PageGarden.csv file as a list of lists. Each sublist will represent one book’s data. With indexing, we can easily select which part of the data we need—like extracting all the ratings to find their average.

Important details about indexing:

  • The first element is always at index 0.
  • If you try to access an index that doesn’t exist (for example, book_1[10]), Python will give you an IndexError because the index is out of range.
  • You can use indexing with variables. For instance, if book_2 is ["Berry Tales", 0.0, "USD", 985500, 4.5], then book_2[3] will give you 985500 (the total number of reviews).

Exercise:

  1. Given the book_2 list:

    book_2 = ["Berry Tales", 0.0, "USD", 985500, 4.5]
    • Extract the total number of reviews and store it in a variable named berry_reviews.
      berry_reviews = book_2[3]
    • Print berry_reviews to confirm it’s 985500.
  2. Do the same for book_3:

    book_3 = ["Moon Over Pine", 7.50, "USD", 899000, 4.6]
    • Extract the average rating and store it in moon_rating.
      moon_rating = book_3[4]
    • Print moon_rating to confirm it’s 4.6.

By practicing indexing, you’ll be better prepared to handle large datasets, where indexing is essential for selecting and using the right parts of your data. In the next section, we’ll explore a related concept called negative indexing, which allows you to access elements from the end of a list.

Negative Indexing

We’ve seen that we can use positive numbers (0, 1, 2, …) to access elements in a list, starting from the front. In Python, there’s another helpful feature called negative indexing, which allows you to access elements from the end of the list backwards.

How does negative indexing work?

  • -1 refers to the last element in the list.
  • -2 refers to the second-to-last element.
  • -3 refers to the third-to-last element, and so forth.

This might seem unusual at first, but it’s a very convenient shortcut when you often need to access the last few elements of a list without knowing its exact length.

Let’s revisit book_1:

book_1 = ["Whispering Leaves", 12.99, "USD", 1050300, 4.2]

Using positive indexing, book_1[4] gives us the average rating (4.2). Using negative indexing:

last_element = book_1[-1]
print(last_element)  # 4.2

book_1[-1] also returns 4.2, the last element of the list. It’s just another way to reach the same data.

Similarly, book_1[-2] would give us the second-to-last element:

second_last = book_1[-2]
print(second_last)  # 1050300 (the total number of reviews)

Why use negative indexing?

  • If you’re always interested in the last element (like the most recent rating or the newest piece of data), negative indexing saves you from having to calculate the length of the list or remember which index number the last element has.
  • For instance, when reading PageGarden.csv, if you want the last piece of data from each row (the average rating), you don’t need to know that it’s at index 4. You can just use [-1] to get it.

Be careful with your indexing!

  • Negative indexing starts at -1 (not -0, which doesn’t exist), so the last element is always [-1].
  • If you use an index that’s too far negative (like book_1[-10] when the list doesn’t have that many elements), you’ll get an IndexError just as you would with a positive index out of range.

Exercise:

  1. Using book_2 (["Berry Tales", 0.0, "USD", 985500, 4.5]), retrieve the average rating using negative indexing and store it in berry_rating.

    berry_rating = book_2[-1]
    print(berry_rating)  # Should print 4.5
  2. Using book_3 (["Moon Over Pine", 7.50, "USD", 899000, 4.6]), retrieve the total number of reviews using negative indexing. Hint: the total reviews are the second-to-last element.

    moon_reviews = book_3[-2]
    print(moon_reviews)  # Should print 899000

With these exercises, you’ll become comfortable using negative indexing to quickly access elements at the end of your lists—another handy tool when dealing with data from large datasets. Next, we’ll learn how to access multiple elements at once using a technique called slicing.

Retrieving Multiple Elements (List Slicing)

So far, we’ve focused on accessing single elements from a list using positive and negative indexing. But what if we need to retrieve more than one element at a time? For example, we might want to separate the title, price, and currency of a book from its review counts and ratings. Instead of accessing each element individually, Python lets us use slicing to extract a range of elements at once.

What is slicing? Slicing allows you to specify a start and an end index, and Python will give you a new list containing all the elements between those two indices. The general syntax for slicing is:

a_list[start:end]
  • start is the index where the slice begins (inclusive).
  • end is the index where the slice ends (exclusive, meaning it does not include the element at this index).

For instance, consider our book_1 list again:

book_1 = ["Whispering Leaves", 12.99, "USD", 1050300, 4.2]

The indices for book_1 are:

  • 0 → “Whispering Leaves”
  • 1 → 12.99
  • 2 → “USD”
  • 3 → 1050300
  • 4 → 4.2

If we want the first three elements (title, price, and currency) in a separate list, we can do:

first_three = book_1[0:3]
print(first_three)

This will print:

["Whispering Leaves", 12.99, "USD"]

We used [0:3] to get elements at indices 0, 1, and 2. Notice that the element at index 3 is not included because slicing stops just before the end index.

Slicing shortcuts:

  • If we omit the start index, Python starts from the beginning of the list. For example, book_1[:3] is the same as book_1[0:3].
  • If we omit the end index, Python goes until the end of the list. For example, book_1[2:] would give us all elements starting from index 2 through the end of the list.
  • If we use book_1[:], we get a copy of the entire list.

Why slicing is useful: When we start working with the full PageGarden.csv as a list of lists, slicing becomes a powerful way to grab certain segments of our data. For instance, if each row has many columns and we only need the first few for a specific analysis, slicing makes this easy.

Another example: If book_2 = ["Berry Tales", 0.0, "USD", 985500, 4.5] and we want just the price and currency, we know these are at indices 1 and 2:

price_currency = book_2[1:3]
print(price_currency)
# This will print: [0.0, "USD"]

Exercise:

  1. Using book_1, create a slice named metadata_1 that contains only the title, price, and currency.

    metadata_1 = book_1[:3]
    print(metadata_1)  # ["Whispering Leaves", 12.99, "USD"]
  2. Using book_3 = ["Moon Over Pine", 7.50, "USD", 899000, 4.6], create a slice named rating_info that contains only the total number of reviews and the rating.

    rating_info = book_3[3:]
    print(rating_info)  # [899000, 4.6]

By practicing slicing, you’ll be able to quickly and easily extract the exact pieces of data you need. In the next section, we’ll discover how to store many rows of data (like multiple books) in a single list, making it even simpler to manage large datasets.

Storing Multiple Rows of Data (Lists of Lists)

So far, we’ve treated each book’s data as a single list with five elements: title, price, currency, total reviews, and average rating. But what if we have many books? Manually creating a separate variable for each book (like book_1, book_2, book_3, and so on) can quickly become overwhelming.

Instead, we can store all these individual lists inside another list, creating what is commonly known as a list of lists. This structure makes it easy to handle larger datasets, like the entire PageGarden.csv file, which might contain thousands of books.

Why a list of lists?

  • If book_1 represents the first row, book_2 the second, book_3 the third, and so on, we can combine them into a single list called library_data.

  • Each element of library_data would be one of these smaller lists. For example:

    book_1 = ["Whispering Leaves", 12.99, "USD", 1050300, 4.2]
    book_2 = ["Berry Tales", 0.0, "USD", 985500, 4.5]
    book_3 = ["Moon Over Pine", 7.50, "USD", 899000, 4.6]
    
    library_data = [book_1, book_2, book_3]

    Now library_data looks like this:

    [
      ["Whispering Leaves", 12.99, "USD", 1050300, 4.2],
      ["Berry Tales", 0.0, "USD", 985500, 4.5],
      ["Moon Over Pine", 7.50, "USD", 899000, 4.6]
    ]

How do we use indexing here?

  • library_data[0] would give you the entire book_1 list.
  • library_data[1] would give you the entire book_2 list, and so on.

If we want to retrieve the title of the second book ("Berry Tales"), we can do:

second_book = library_data[1]       # ["Berry Tales", 0.0, "USD", 985500, 4.5]
second_title = library_data[1][0]   # "Berry Tales"

Notice how we used two indices here:

  • library_data[1] to select the second list (the second book),
  • [0] to select the first element of that list (the title).

This concept of using multiple indices in a row is called chained indexing, and it’s extremely useful for working with more complex data structures.

Advantages of lists of lists:

  • You can loop through library_data to process every book at once.
  • You can easily select individual pieces of data from any book by combining indexing and slicing.
  • As we’ll see later, this structure is perfect for reading in data from files, like PageGarden.csv, directly into Python.

Exercise:

  1. Create a library_data list of lists from the three book lists you’ve already defined:

    library_data = [book_1, book_2, book_3]
  2. Print library_data to see how it looks.

  3. Retrieve the average rating of the third book using chained indexing. For book_3 = ["Moon Over Pine", 7.50, "USD", 899000, 4.6], the rating is at index 4.

    third_rating = library_data[2][4]
    print(third_rating)  # Should print 4.6

By organizing data this way, handling more rows becomes much simpler. In the next section, we’ll start learning how to read data directly from a file and transform it into a list of lists, just like we did by hand.

Opening and Reading a Data File

Up to this point, we’ve been manually creating lists for each book. In a real-world scenario, you’ll often receive data in the form of a file rather than typing it out yourself. Python makes it possible to open and read files so that you can transform their contents into lists and then analyze the data.

The PageGarden.csv file Recall that PageGarden.csv is a file (a text document) containing many rows of data about books. Each row is structured as follows:

title,price,currency,total_reviews,avg_rating

For example:

"Whispering Leaves",12.99,USD,1050300,4.2
"Berry Tales",0.0,USD,985500,4.5
"Moon Over Pine",7.50,USD,899000,4.6
...

The first row is the header row, which describes what each column represents. Every subsequent row describes one book. When we read the file into Python, our goal is to end up with a list of lists—one sublist per book—just like library_data but on a much larger scale.

How to open a file in Python: You can use the built-in open() function to open a file. For example:

opened_file = open("PageGarden.csv")
  • open("PageGarden.csv") tries to find a file named “PageGarden.csv” in the same directory (folder) where your Python code is running.
  • The result is something called a “file object,” which we store in opened_file.

Reading the file’s contents: After opening the file, you can read its entire contents into a single string using the read() method:

file_contents = opened_file.read()
  • file_contents will now contain the entire text of PageGarden.csv in one long string. This includes the header row and all the data rows, separated by newline characters (\n).

What does file_contents look like? If you print file_contents[:300], you’ll see the first 300 characters. This helps you verify that you’re reading the file correctly. It might look like this:

title,price,currency,total_reviews,avg_rating
"Whispering Leaves",12.99,USD,1050300,4.2
"Berry Tales",0.0,USD,985500,4.5
...

Closing the file: When you finish reading the file, it’s good practice to close it:

opened_file.close()

This frees up resources on your computer. Even though Python often does this automatically when your program ends, it’s a good habit to close files after you’re done with them.

Why read into a string first? Right now, we have all the data in one long string. That’s not very convenient for analysis. In the next sections, we’ll learn how to break this string apart—first into separate rows, and then into separate columns—so that we can build a list of lists. Once we have a list of lists, we can use all the indexing and slicing techniques we’ve practiced to analyze the dataset.

Exercise:

  1. Open the PageGarden.csv file and store the file object in a variable named opened_file.
  2. Read the file’s contents into a variable named file_contents.
  3. Print out the first 200 characters of file_contents to check that it worked.
  4. Close the file.

Example:

opened_file = open("PageGarden.csv")
file_contents = opened_file.read()
print(file_contents[:200])
opened_file.close()

If everything’s working correctly, you’ll see the header row followed by some of the data. In the next section, we’ll learn how to split the string into rows and eventually convert those rows into lists.

From a Single String to Separate Rows

We now have the entire contents of PageGarden.csv stored in a single string. While this is a good start, having all the data in one long string isn’t very convenient for analysis. Our next goal is to break this large string into smaller, more manageable pieces.

Remember the structure of the CSV file:

title,price,currency,total_reviews,avg_rating
"Whispering Leaves",12.99,USD,1050300,4.2
"Berry Tales",0.0,USD,985500,4.5
"Moon Over Pine",7.50,USD,899000,4.6
...

Each line represents a different row of data. The first line is the header row, and each subsequent line describes one book.

Splitting by newline characters: Inside the string, each new line of the file is represented by a special character called a newline (\n). If we “split” the string whenever we encounter a \n, we get a list where each element is one line of the file.

For example:

# Assume file_contents contains the entire CSV data as one string.
rows = file_contents.split("\n")
  • After this step, rows is a list where:
    • rows[0] is the header: "title,price,currency,total_reviews,avg_rating"
    • rows[1] is the first book’s data: "Whispering Leaves",12.99,USD,1050300,4.2"
    • rows[2] is the second book’s data: "Berry Tales",0.0,USD,985500,4.5"
    • and so forth.

Why is this helpful? Now that we have each row as a separate string, it’s easier to work with. We can:

  • Inspect individual rows to understand the data better.
  • Further split each row by commas to separate the columns.
  • Remove the header row if we only need to analyze the data.

Check the result: You can print the first few rows to see what they look like:

print(rows[:5])  # This prints the first five rows (including the header)

You should see something like:

[
  "title,price,currency,total_reviews,avg_rating",
  "\"Whispering Leaves\",12.99,USD,1050300,4.2",
  "\"Berry Tales\",0.0,USD,985500,4.5",
  "\"Moon Over Pine\",7.50,USD,899000,4.6",
  ...
]

Dealing with extra newline characters: Sometimes, CSV files may have blank lines at the end or in between. If that happens, you might get empty strings in your rows list. You can filter them out using techniques we’ll learn later, or just be aware that they can appear and handle them gracefully.

Next steps: Now that we’ve split the data by lines, our next goal is to split each line by commas so that each line becomes a list of individual values. Eventually, we’ll transform this into a list of lists—one list per book—making it easy to perform calculations like average ratings across all books.

Exercise:

  1. Assuming you have file_contents from the previous step, split it into rows by using:
    rows = file_contents.split("\n")
  2. Print the first three rows (rows[:3]) to inspect the data.
  3. Look at rows[0] and identify the headers.
  4. Look at rows[1] and identify which parts represent the title, price, currency, total reviews, and average rating.

This process transforms an unstructured string into a list of rows—bringing us one step closer to a clean, easily analyzable data structure. Next, we’ll tackle splitting each row into individual columns.

Splitting Rows into Columns

We’ve successfully split our large string into individual rows, each of which is still just one long piece of text. Remember, each row of the CSV file follows the same pattern:

"title","price","currency","total_reviews","avg_rating"
"Whispering Leaves",12.99,USD,1050300,4.2
"Berry Tales",0.0,USD,985500,4.5
"Moon Over Pine",7.50,USD,899000,4.6
...

Within each row, the values are separated by commas. If we can split each row by the comma character (","), we can isolate the individual pieces of data (title, price, currency, total reviews, and avg_rating) and store them in a list. Ultimately, this will give us a list of lists, where each inner list corresponds to one book.

How to split by commas: If rows is a list of strings, where each element is one row of the file, we can select a specific row and split it again:

first_data_row = rows[1]  # For example, "\"Whispering Leaves\",12.99,USD,1050300,4.2"
columns = first_data_row.split(",")
print(columns)

After splitting by the comma, columns might look like this:

['"Whispering Leaves"', '12.99', 'USD', '1050300', '4.2']

Notice that the title is surrounded by quotation marks and appears as '"Whispering Leaves"'. The other values look cleaner. We’ll later learn how to remove these quotes or handle them. For now, the important part is that we have successfully broken the row into separate pieces.

Doing this for all rows: Eventually, we’ll loop over each element in rows (except the header) and split by commas to turn every row into a list of values. This will give us a structure like:

[
  ['title', 'price', 'currency', 'total_reviews', 'avg_rating'],
  ['"Whispering Leaves"', '12.99', 'USD', '1050300', '4.2'],
  ['"Berry Tales"', '0.0', 'USD', '985500', '4.5'],
  ['"Moon Over Pine"', '7.50', 'USD', '899000', '4.6'],
  ...
]

Header vs. Data Rows:

  • The first row (rows[0]) is the header row. It shows column names rather than data about a specific book.
  • From rows[1] onwards, we have data rows representing individual books.

When we start analyzing the data, we may not need the header row. Often, data analysts store the header row separately and then focus on the data rows when performing calculations.

Next steps: Once we have each row split into columns, we can:

  • Convert numeric values (like total_reviews and avg_rating) from strings into integers or floats.
  • Clean up any extra characters (like quotation marks).
  • Perform calculations, such as finding the average rating of all books.

Exercise:

  1. Select a data row, for example row = rows[1].
  2. Split row by commas:
    columns = row.split(",")
    print(columns)
  3. Identify which elements correspond to the title, price, currency, total reviews, and rating.
  4. Print out the columns list to confirm that the data is now separated into individual strings.

By splitting each row into columns, we now have the building blocks we need to create a full list of lists structure—one step closer to straightforward data analysis. In the next lesson, we’ll transform all rows, not just one, into lists of columns and combine them into a single data structure.

Converting All Rows into a List of Lists

We now know how to transform one row of CSV data into a list of individual values by splitting on commas. The next step is to apply this process to every row in rows to create a list of lists—a structure where each element is a small list representing one book’s data.

What we have so far:

  • rows: a list of strings, where each string is one line from PageGarden.csv.
  • The first element of rows (rows[0]) is the header: "title,price,currency,total_reviews,avg_rating".
  • Each subsequent element (like rows[1], rows[2], etc.) is a data row, something like: "\"Whispering Leaves\",12.99,USD,1050300,4.2"

Our goal:

  • Turn each row into a list of values by splitting it on ",".
  • Collect all these lists into a single variable, such as apps_data or library_data. After we do this, we’ll end up with something like:
    [
      ['title', 'price', 'currency', 'total_reviews', 'avg_rating'],
      ['"Whispering Leaves"', '12.99', 'USD', '1050300', '4.2'],
      ['"Berry Tales"', '0.0', 'USD', '985500', '4.5'],
      ['"Moon Over Pine"', '7.50', 'USD', '899000', '4.6'],
      ...
    ]

This structure is much easier to work with because now we can loop over it, access individual elements with indexing, and convert numbers from strings to floats or integers as needed.

How to do this conversion:

  1. Initialize an empty list to store all the rows:

    data_as_lists = []
  2. Loop over each element in rows. For each row in rows:

    • Split the row by commas to get a list of values.
    • Append that list of values to data_as_lists.

    For example:

    data_as_lists = []
    for row in rows:
        columns = row.split(",")
        data_as_lists.append(columns)
  3. After the loop finishes, data_as_lists[0] should be the header row as a list, and data_as_lists[1] should be the first book’s data as a list, etc.

Inspecting the result:

  • Print len(data_as_lists) to see how many rows (including the header) you have.
  • Print data_as_lists[0] to see the header row as a list.
  • Print data_as_lists[1] to see the first book’s data as a list.

Cleaning and conversion: Notice that some of the values may still have quotation marks (like '"Whispering Leaves"') and all numeric values (like '1050300' and '4.2') are still strings. In future steps, we’ll handle this by removing unwanted characters and converting numbers from strings to the appropriate numeric types. For now, focus on verifying that you have a list of lists structure.

Exercise:

  1. Create an empty list called data_as_lists.
  2. Loop through all the rows in rows and split each row into columns.
  3. Append each list of columns to data_as_lists.
  4. Print the first five elements of data_as_lists to confirm that the transformation worked correctly.

By completing this task, you are turning a raw CSV file into a structured Python data variable that’s ready for analysis. In the next sections, we’ll learn how to work with this data more effectively, including cleaning up unwanted characters, converting types, and eventually calculating interesting metrics like the average rating of all books.

Cleaning and Converting Data Types

Now that we have data_as_lists, a list of lists where each inner list represents one row from PageGarden.csv, we need to clean and prepare this data for analysis. Two common tasks at this stage are:

  1. Removing unwanted characters, like extra quotation marks in titles.
  2. Converting numeric values (like total reviews and average rating) from strings to integers or floats, so we can perform calculations.

Why do we need to do this?

  • If we leave the quotation marks in titles, it might be harder to display the data nicely or match it against other information.
  • If we keep numbers as strings, we can’t easily add them up, find their averages, or perform other mathematical operations.

Example: Titles with quotes If the title appears as '"Whispering Leaves"', we may want to remove the extra quotes. There are several ways to clean strings in Python, one simple approach is to use the strip() method to remove leading and trailing characters:

title = '"Whispering Leaves"'
clean_title = title.strip('"')
print(clean_title)  # "Whispering Leaves"

This removes any " characters from the start and end of the string, leaving you with a cleaner title.

Converting numeric values Right now, values like '1050300' or '4.2' are stored as strings. To work with them as numbers:

  • Use int() for whole numbers (like total reviews).
  • Use float() for decimal numbers (like average rating).

For example:

reviews_str = "1050300"
reviews_int = int(reviews_str)
print(reviews_int)  # 1050300 as an integer

rating_str = "4.2"
rating_float = float(rating_str)
print(rating_float)  # 4.2 as a float, now we can do arithmetic

Applying this to the dataset:

  • The header row is probably fine as strings since it’s just column names.
  • For each data row, we might do something like:
    row = ['"Whispering Leaves"', '12.99', 'USD', '1050300', '4.2']
    
    # Clean the title
    row[0] = row[0].strip('"')
    
    # Convert price to float
    row[1] = float(row[1])
    
    # Currency is fine as a string (no conversion needed)
    
    # Convert total reviews to int
    row[3] = int(row[3])
    
    # Convert avg_rating to float
    row[4] = float(row[4])
    
    print(row)
    # ["Whispering Leaves", 12.99, "USD", 1050300, 4.2]

After these conversions, the data is in a much better format for analysis. You can now easily sum up ratings, calculate averages, and sort the data based on numeric values.

Exercise:

  1. Select a row from data_as_lists (other than the header). For example:
    example_row = data_as_lists[1]
  2. Clean the title by removing extra quotes.
  3. Convert the price and avg_rating to floats.
  4. Convert total_reviews to an integer.
  5. Print the cleaned and converted row to confirm the changes.

By cleaning and converting the data, you’ve made it ready for analysis. In the next sections, you’ll learn how to loop over the entire dataset to apply these transformations to every row, and then start performing calculations like finding the average rating across all books.

Applying Transformations to the Entire Dataset

Now that you understand how to clean and convert data for a single row, it’s time to apply these steps to every row in the dataset. This is where the power of loops really shines. Instead of manually cleaning each row, you can write a loop to process all the rows automatically.

What needs to be done for each row?

  • Skip the header row, since it doesn’t contain numerical data.
  • Remove any unwanted quotation marks from the title.
  • Convert the price and average rating from strings to floats.
  • Convert the total number of reviews from a string to an integer.

After this process, every data row in data_as_lists will be in a uniform, clean format, which makes analysis much simpler.

Example:

# Assuming data_as_lists looks like this (simplified):
# [
#   ['title', 'price', 'currency', 'total_reviews', 'avg_rating'],
#   ['"Whispering Leaves"', '12.99', 'USD', '1050300', '4.2'],
#   ['"Berry Tales"', '0.0', 'USD', '985500', '4.5'],
#   ...
# ]

# We start looping from index 1 to skip the header row at index 0.
for i in range(1, len(data_as_lists)):
    row = data_as_lists[i]
    # Clean the title
    row[0] = row[0].strip('"')
    # Convert price to float
    row[1] = float(row[1])
    # Currency stays as is (string)
    # Convert total_reviews to int
    row[3] = int(row[3])
    # Convert avg_rating to float
    row[4] = float(row[4])

After running this loop, every row in data_as_lists (except the header) will be cleaned and converted into the correct data types. Now data_as_lists might look like this:

[
  ['title', 'price', 'currency', 'total_reviews', 'avg_rating'],
  ['Whispering Leaves', 12.99, 'USD', 1050300, 4.2],
  ['Berry Tales', 0.0, 'USD', 985500, 4.5],
  ['Moon Over Pine', 7.50, 'USD', 899000, 4.6],
  ...
]

Why do this for the entire dataset? Having your entire dataset properly formatted means you can now:

  • Calculate statistics, like the average rating for all books.
  • Sort the data by total reviews or rating.
  • Filter the data to find books above or below certain thresholds.

All of these tasks require your data to be in a consistent and numeric-friendly format, which is exactly what this step achieves.

Exercise:

  1. Write a loop that starts from i = 1 and goes through all the rows of data_as_lists.
  2. For each row:
    • Clean the title by stripping quotes.
    • Convert the price and avg_rating to floats.
    • Convert total_reviews to an int.
  3. After the loop, print a few rows from data_as_lists to ensure the changes took effect.

With this transformation step completed, your dataset is now truly ready for data analysis. In the next sections, we’ll learn how to utilize these clean values to derive meaningful insights, like computing averages and other statistics from the data.

Calculating the Average Rating

Now that your data is clean and each book’s rating is stored as a float, you can start performing calculations. One of the simplest and most common tasks is to find the average rating of all the books in the dataset. This is a straightforward operation now that ratings are numeric values (floats).

How to calculate an average:

  1. Sum all the ratings.
  2. Divide the sum by the number of ratings.

If you want the average rating of all books:

  • Make sure to skip the header row, since it doesn’t contain ratings.
  • Loop over each data row.
  • Extract the rating from the appropriate index (for our dataset, this should be at index 4).
  • Add it to a running total.
  • After processing all rows, divide the total by the number of rows (excluding the header).

Example:

# data_as_lists is already cleaned and converted
# data_as_lists[0] is the header row
# data_as_lists[1:] is all the book data

rating_sum = 0
count = 0

for i in range(1, len(data_as_lists)):
    row = data_as_lists[i]
    rating = row[4]  # avg_rating is at index 4
    rating_sum += rating
    count += 1

avg_rating = rating_sum / count
print("Average rating:", avg_rating)

What next? With the average rating, you have a measure of how the books in the PageGarden.csv dataset generally perform according to user reviews. You can apply similar techniques to find:

  • The average price of books.
  • The minimum or maximum number of reviews.
  • The most common currency (if the dataset had multiple currencies).

Exercise:

  1. Using your cleaned data_as_lists, calculate the average rating of all books.
  2. Print out the result.

By doing this, you’ve taken raw data from a file, turned it into a manageable structure, cleaned and converted it, and finally performed a calculation that yields a meaningful insight about the dataset. In the next section, we’ll wrap up what we’ve learned and look ahead to more advanced techniques.

Next Steps

In this lesson, you learned:

  • How to store related data points in lists.
  • How to access elements and slices within a list.
  • How to use for loops for repetitive tasks.
  • How to read file data and transform it into lists of lists.
  • How to compute averages from large datasets.

These skills lay the groundwork for more sophisticated data analysis techniques. As you move forward, you’ll discover new data structures, learn about filtering and sorting data, and eventually explore how these fundamentals link to more advanced analytical tasks, including working alongside AI-driven tools.

By mastering lists and loops now, you’re building a strong foundation for all your future data analysis endeavors.