Programming Across Disciplines

Table of Contents » Chapter 3 : Processing : Case Studies : Datasets

Datasets

Overview
Example Dataset
Publicly Available Datasets

Overview

Datasets are structured collections of data, commonly used in various fields such as science, finance, and information technology for analysis, machine learning, and statistical modeling. Typically organized in a tabular format with rows and columns, datasets can range from simple spreadsheets to complex databases, where each row represents an individual record or observation and each column represents a specific variable or attribute of that record. For example, a dataset for a weather study might include columns for date, temperature, humidity, and wind speed, with each row containing the measurements for a specific day. Datasets can be sourced from various origins, including publicly available datasets on the internet, proprietary data collected by organizations, or generated through simulations.

Concept: Dataset

Full Concepts List: Alphabetical ↗ or By Chapter ↗

In programming, a dataset is a collection of data that is typically organized and structured in some manner. It serves as the raw material from which information can be extracted, processed, analyzed, and interpreted. Datasets can come in various forms and sizes, ranging from simple arrays of numbers to complex, multi-dimensional structures containing diverse types of data. They can be sourced from various places like databases, file systems, APIs, or generated through simulations and experiments. In the context of data analysis, machine learning, or scientific research, a dataset is crucial as it forms the basis upon which algorithms operate and insights are drawn. The quality, relevance, and integrity of a dataset greatly influence the outcomes of any computational process applied to it. Good dataset management practices involve cleaning, organizing, and sometimes transforming data to make it suitable for specific tasks. The representation and handling of datasets are often facilitated by programming libraries and tools tailored to the specific needs of the data and the objectives of the project.

Datasets can range from small, simple collections used for basic tasks to vast, complex aggregations of information employed in advanced computational analyses. The significance of datasets extends beyond mere data storage; they are invaluable for training algorithms, uncovering insights through data analysis, and driving decision-making processes in various fields.

Many datasets are publicly available, offering a rich resource for programmers and data scientists. These are found on platforms like Kaggle ↗, which hosts datasets ranging from historical election results to the latest trends in video gaming, or the UCI Machine Learning Repository, offering datasets for experimental purposes in machine learning, such as the famous Iris dataset or the Wine Quality dataset. Government websites also provide access to public data, like demographic information and economic indicators.

With its robust libraries and tools, Python excels in importing, processing, and analyzing these data sets, allowing you to glean meaningful information from them. This section will guide you through the fundamentals of working with data sets in Python, illustrating how to access, manipulate, and extract value from these treasure troves of information, thereby unlocking a world of possibilities in data-driven programming.

Example Dataset

Let's look at an example dataset and walk through how we might process it. The example dataset below contains the high and low temperatures for each day of November 2023 in Salt Lake City, Utah. Notice that there is header row that lable each attribute (column). Also, there is one record (row) for each day of the month. Each record contains the date, the high temperature and the low temperature for that day.

Date,Hi,Lo
2023-11-01,58,32
2023-11-02,64,35
2023-11-03,67,44
2023-11-04,67,41
2023-11-05,65,45
2023-11-06,71,48
2023-11-07,52,38
2023-11-08,49,34
2023-11-09,49,31
2023-11-10,53,31
2023-11-11,54,31
2023-11-12,63,33
2023-11-13,66,42
2023-11-14,66,38
2023-11-15,68,44
2023-11-16,59,41
2023-11-17,58,36
2023-11-18,52,39
2023-11-19,49,35
2023-11-20,46,34
2023-11-21,48,30
2023-11-22,49,30
2023-11-23,42,32
2023-11-24,36,32
2023-11-25,38,29
2023-11-26,36,29
2023-11-27,40,25
2023-11-28,41,24
2023-11-29,36,25
2023-11-30,34,30

Data Details

Data Line 1: This is the header of the data. It contains the names of the columns (like you would see in a spreadsheet or CSV files). Most datasets include this first row to help you identify what each column contains.
Data Lines 3 thru 31: These are the records in this dataset. Note there are 30 records, and each contains a date (Column 1), a high temperature (Column 2), and a low temperature (Column 3).

What kinds of information, summaries, or aggregations could we create using this dataset?

Calculate the average temperature for each day.
Determine the date that had the highest temperature of the month.
Determine the date the had the lowest temperature of the month.
Calculate the average temperature of the entire month.
Create a graph that visually depicts the temperature of all of the days of the month.

Let's walk through a solution that handles all five of these tasks. First, let's set up the data in a Python data structure. There are some different options, but for this example I will format the data and create a list of lists directly in my code file. In a production system, we would not include our data directly in our code file, but for demonstration, it's easier to see everything (data and code) together. In a production system, the data would be commonly stored in a file or a database table. For now, though, I'll keep it all together in my code file (I'll show a CSV file-based example for comparison below).

Do you remember how to create a list of lists? Here's the general form we need:

list = [[List 1], [List 2], [List 3], ..., [List n]]

In the case of our dataset above, we would form a list of each record (row). So, for example, we would reformat the first record of the dataset:

2023-11-01,58,32

["2023-11-01", 58, 32]

And then if we format all 30 of the records in the data set as a list, separate each list with a comma, and place the entire set of list inside the square brackets of a list, it would looke like this:

# Dataset formatted as a list of lists
data = 
[
  ["2023-11-01", 58, 32], ["2023-11-02", 64, 35], ["2023-11-03", 67, 44],
  ["2023-11-04", 67, 41], ["2023-11-05", 65, 45], ["2023-11-06", 71, 48],
  ["2023-11-07", 52, 38], ["2023-11-08", 49, 34], ["2023-11-09", 49, 31],
  ["2023-11-10", 53, 31], ["2023-11-11", 54, 31], ["2023-11-12", 63, 33],
  ["2023-11-13", 66, 42], ["2023-11-14", 66, 38], ["2023-11-15", 68, 44],
  ["2023-11-16", 59, 41], ["2023-11-17", 58, 36], ["2023-11-18", 52, 39],
  ["2023-11-19", 49, 35], ["2023-11-20", 46, 34], ["2023-11-21", 48, 30],
  ["2023-11-22", 49, 30], ["2023-11-23", 42, 32], ["2023-11-24", 36, 32],
  ["2023-11-25", 38, 29], ["2023-11-26", 36, 29], ["2023-11-27", 40, 25],
  ["2023-11-28", 41, 24], ["2023-11-29", 36, 25], ["2023-11-30", 34, 30]
]

A list of lists like this will make it reasonably easy to loop through the dataset to process each record.

Next, I'll write the code to create a few variables to use during processing, loop through the list of lists to calculate the daily average temperatures, append each day to a processed queue (list), and also keep a running total temperature and record count that I'll use at the end to calculate the average temperature for the month. After the calculation loop, I'll print the results.

data = [
    ["2023-11-01", 58, 32], ["2023-11-02", 64, 35], ["2023-11-03", 67, 44],
    ["2023-11-04", 67, 41], ["2023-11-05", 65, 45], ["2023-11-06", 71, 48],
    ["2023-11-07", 52, 38], ["2023-11-08", 49, 34], ["2023-11-09", 49, 31],
    ["2023-11-10", 53, 31], ["2023-11-11", 54, 31], ["2023-11-12", 63, 33],
    ["2023-11-13", 66, 42], ["2023-11-14", 66, 38], ["2023-11-15", 68, 44],
    ["2023-11-16", 59, 41], ["2023-11-17", 58, 36], ["2023-11-18", 52, 39],
    ["2023-11-19", 49, 35], ["2023-11-20", 46, 34], ["2023-11-21", 48, 30],
    ["2023-11-22", 49, 30], ["2023-11-23", 42, 32], ["2023-11-24", 36, 32],
    ["2023-11-25", 38, 29], ["2023-11-26", 36, 29], ["2023-11-27", 40, 25],
    ["2023-11-28", 41, 24], ["2023-11-29", 36, 25], ["2023-11-30", 34, 30]
]

processed_data = []
total_temp = 0
count = 0
for record in data:
    date, hi, lo = record
    avg_temp = (hi + lo) / 2
    processed_data.append([date, hi, lo, avg_temp])  # Append as list
    total_temp += avg_temp
    count += 1
print()
print("Salt Lake City\nNovember 2023 Temperature Chart")
print("-" * 40)
print("Date".ljust(15) + "Low".ljust(10) + "High".ljust(10))
print("-" * 40)
for row in data:
  print(row[0].ljust(15), str(row[1]).ljust(10), str(row[2]).ljust(10))
print("-" * 40)
monthly_avg_temp = total_temp / count
print("Average Temperature: ", "{:.2f}".format(monthly_avg_temp), "Degrees")
print("-" * 40)

Output

Salt Lake City
November 2023 Temperature Chart
----------------------------------------
Date           Low       High      
----------------------------------------
2023-11-01      58         32        
2023-11-02      64         35        
2023-11-03      67         44        
2023-11-04      67         41        
2023-11-05      65         45        
2023-11-06      71         48        
2023-11-07      52         38        
2023-11-08      49         34        
2023-11-09      49         31        
2023-11-10      53         31        
2023-11-11      54         31        
2023-11-12      63         33        
2023-11-13      66         42        
2023-11-14      66         38        
2023-11-15      68         44        
2023-11-16      59         41        
2023-11-17      58         36        
2023-11-18      52         39        
2023-11-19      49         35        
2023-11-20      46         34        
2023-11-21      48         30        
2023-11-22      49         30        
2023-11-23      42         32        
2023-11-24      36         32        
2023-11-25      38         29        
2023-11-26      36         29        
2023-11-27      40         25        
2023-11-28      41         24        
2023-11-29      36         25        
2023-11-30      34         30        
----------------------------------------
Average Temperature:  43.57 Degrees
----------------------------------------

Code & Output Details

Code Lines 1 thru 12: This is the list of lists in our dataset.
Code Line 14: This is a new list variable we'll use to store our processed list, containing our original data values and the calculated daily average temperature.
Code Line 15: This variable will store the running total temperature that we'll use after looping through our dataset to calculate the average temperature for the month.
Code Line 18: This counter variable will count the number of records in our dataset. Obviously, we know how many there are in this particular dataset, but we often do not know ahead of time, so setting this up as variable-based is more conducive to handling any number of records.
Code Line 21: Next, we use a for loop structure to iterate through the dataset to calculate the daily average temperature and load our processed_data list with those results. In this loop, we'll also collect our total temperature and iterate our counter so we can calculate our average monthly temperature after the loop as well.
- Note: Notice the use of the multi-variable for loop signature. If you need a refresher on this, revisit the for Loop ↗ page.
Code Line 22: This first line inside the for loop calculates the dataset's average temperature for each record (day). Notice that it the calculation (hi + lo) / 2 is assigned to the variable avg_temp.
Code Line 23: This line adds (appends) each record to the processed_data list, effectively copying the original dataset list values date, hi, and lo and adding the calculated avg_temp value.
Code Line 24: This line accumulates the temperature of each day as the loop iterates. We'll use this after the loop to calculate the average temperature for the month.
Code Line 25: This line accumulates the number of records in the dataset. We'll use this, along with the total_temp values, after the loop to calculate the average temperature for the month.
Code Line 28: After the loop has finished, this line uses the total_temp and count variables to calculate the average temperature for the month and stores it in the variable monthly_avg_temp.
Code Line 31: This line prints the processed_data list. Note this is not formatted very well. We'll take a look at improving this below.
Code Line 32: And lastly, this line prints a formatted statement that reports the average temperature for the month.

Improving the Output of the Processed Dataset

As indicated in the Code Details above, printing the processed_data list with no formatting produces output unsuitable for users to read. Since it is a list of lists, how might we improve the output?

One approach would be to iterate through the processed_data list and print one line for each day that displays the date and average temperature for the date, and also print a set of text characters that visually represent the average temperature as well. Study the following code, which is a copy of the completed code above, with Code Line 31 (above) replaced by Code Lines 32 through 36 in the new version of the code below. See the Code Details below for a description of this change.

data = [
    ["2023-11-01", 58, 32], ["2023-11-02", 64, 35], ["2023-11-03", 67, 44],
    ["2023-11-04", 67, 41], ["2023-11-05", 65, 45], ["2023-11-06", 71, 48],
    ["2023-11-07", 52, 38], ["2023-11-08", 49, 34], ["2023-11-09", 49, 31],
    ["2023-11-10", 53, 31], ["2023-11-11", 54, 31], ["2023-11-12", 63, 33],
    ["2023-11-13", 66, 42], ["2023-11-14", 66, 38], ["2023-11-15", 68, 44],
    ["2023-11-16", 59, 41], ["2023-11-17", 58, 36], ["2023-11-18", 52, 39],
    ["2023-11-19", 49, 35], ["2023-11-20", 46, 34], ["2023-11-21", 48, 30],
    ["2023-11-22", 49, 30], ["2023-11-23", 42, 32], ["2023-11-24", 36, 32],
    ["2023-11-25", 38, 29], ["2023-11-26", 36, 29], ["2023-11-27", 40, 25],
    ["2023-11-28", 41, 24], ["2023-11-29", 36, 25], ["2023-11-30", 34, 30]
]
processed_data = []
total_temp = 0
count = 0

for record in data:
    date, hi, lo = record
    avg_temp = (hi + lo) / 2
    processed_data.append([date, hi, lo, avg_temp])
    total_temp += avg_temp
    count += 1
monthly_avg_temp = total_temp / count
print("-" * 90)
print("Salt Lake City - November 2023 - Temperature Chart")
print("-" * 90)
for record in processed_data:
    date, _, _, avg_temp = record  # Unpack each list
    hash_graph = '\u2584' * int(avg_temp)
    print(f"{date}: {avg_temp}°F - {hash_graph}")
print("-" * 90)
print("Average Temperature during November 2023: ", "{:.2f}".format(monthly_avg_temp))
print("-" * 90)

Output

------------------------------------------------------------------------------------------
Salt Lake City - November 2023 - Temperature Chart
------------------------------------------------------------------------------------------
2023-11-01: 45.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-02: 49.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-03: 55.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-04: 54.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-05: 55.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-06: 59.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-07: 45.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-08: 41.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-09: 40.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-10: 42.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-11: 42.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-12: 48.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-13: 54.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-14: 52.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-15: 56.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-16: 50.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-17: 47.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-18: 45.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-19: 42.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-20: 40.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-21: 39.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-22: 39.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-23: 37.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-24: 34.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-25: 33.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-26: 32.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-27: 32.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-28: 32.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-29: 30.5°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
2023-11-30: 32.0°F - ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
------------------------------------------------------------------------------------------
Average Temperature during November 2023:  43.57
------------------------------------------------------------------------------------------

Note: While the above textual graph is an improvement over simply printing the Python list of lists, it is more common that data visualizations are produced in a more graphical (colors, shapes, images, etc.) manner. We will use this example dataset above to create a more visual version of similar data when we look at Data Visualization later in the book.

Publicly Available Datasets

Many organizations and websites make datasets available to the public. Here is a short list of a few of them to get you started exploring different data types available in datasets.

Nasa Earth Observation Data ↗
Google Dataset Search ↗
Kaggle ↗
Data.Gov ↗
World Health Organization (WHO) ↗
BFI Film Industry Statistics ↗
New York City Taxi Trip Record Data ↗
FBI Crime Data Explorer ↗
... and many more, like those found ... here ↗ ... or here ↗ ... or here ↗.

Public datasets are excellent resources for learning, research, and analysis. Most of the datasets available at the above locations are free to use for educational and non-commercial use. Some of these public datasets are small, others are huge. I recommend that you visit some of these sites and explore the types of data available to the public.

« Previous : Processing : Case Studies : eCommerce Simulation

Next : Processing : Case Studies : Web Scraping »

« Previous : Processing : Case Studies : eCommerce Simulation
Next : Processing : Case Studies : Web Scraping »