Programming Across Disciplines

Table of Contents » Chapter 2 : Data (Input) : Datasets

Datasets

Subscribe Contact

Overview
Example Dataset
Publicly Available Datasets

Overview

In programming, datasets play an essental role and form the basis of data analysis, machine learning, and many other applications. A dataset is a collection of data, typically organized in a structured format like one or more tables, where rows represent individual records and columns represent attributes or features of these records. For instance, a data set might comprise a list of customers, with each row, detailing a customer's name, age, and purchase history, or it could be a complex set of meteorological data, charting temperatures and weather conditions across different geographies and times.

Concept: Dataset

Full Concepts List: Alphabetical ↗ or By Chapter ↗

In programming, a dataset is a collection of data that is typically organized and structured in some manner. It serves as the raw material from which information can be extracted, processed, analyzed, and interpreted. Datasets can come in various forms and sizes, ranging from simple arrays of numbers to complex, multi-dimensional structures containing diverse types of data. They can be sourced from various places like databases, file systems, APIs, or generated through simulations and experiments. In the context of data analysis, machine learning, or scientific research, a dataset is crucial as it forms the basis upon which algorithms operate and insights are drawn. The quality, relevance, and integrity of a dataset greatly influence the outcomes of any computational process applied to it. Good dataset management practices involve cleaning, organizing, and sometimes transforming data to make it suitable for specific tasks. The representation and handling of datasets are often facilitated by programming libraries and tools tailored to the specific needs of the data and the objectives of the project.

Datasets can range from small, simple collections used for basic tasks to vast, complex aggregations of information employed in advanced computational analyses. The significance of datasets extends beyond mere data storage; they are invaluable for training algorithms, uncovering insights through data analysis, and driving decision-making processes in various fields.

Many datasets are publicly available, offering a rich resource for programmers and data scientists. These are found on platforms like Kaggle, which hosts datasets ranging from historical election results to the latest trends in video gaming, or the UCI Machine Learning Repository, offering datasets for experimental purposes in machine learning, such as the famous Iris dataset or the Wine Quality dataset. Government websites also provide access to public data, like demographic information and economic indicators.

With its robust libraries and tools, Python excels in importing, processing, and analyzing these data sets, allowing you to glean meaningful information from them. This section will guide you through the fundamentals of working with data sets in Python, illustrating how to access, manipulate, and extract value from these treasure troves of information, thereby unlocking a world of possibilities in data-driven programming.

Example Dataset

Let's look at an example dataset and walk through how we might process it. This dataset contains the high and low temperatures for each day of November 2023 in Salt Lake City, Utah. Notice that there are header rows that lable each attribute (column). Also, there is one record (row) for each day of the month. Each record contains the date, the high temperature and the low temperature for that day.

Date,Hi,Lo
2023-11-01,58,32
2023-11-02,64,35
2023-11-03,67,44
2023-11-04,67,41
2023-11-05,65,45
2023-11-06,71,48
2023-11-07,52,38
2023-11-08,49,34
2023-11-09,49,31
2023-11-10,53,31
2023-11-11,54,31
2023-11-12,63,33
2023-11-13,66,42
2023-11-14,66,38
2023-11-15,68,44
2023-11-16,59,41
2023-11-17,58,36
2023-11-18,52,39
2023-11-19,49,35
2023-11-20,46,34
2023-11-21,48,30
2023-11-22,49,30
2023-11-23,42,32
2023-11-24,36,32
2023-11-25,38,29
2023-11-26,36,29
2023-11-27,40,25
2023-11-28,41,24
2023-11-29,36,25
2023-11-30,34,30

Code Details

Code Line 1: This is the header of the data. It contains the names of the columns (like you would see in a spreadsheet or CSV files). Most datasets include this first row to help you identify what each column contains.
Code Lines 3 thru 31: These are the records in this dataset. Note there are 30 records, and each contains a date (Column 1), a high temperature (Column 2), and a low temperature (Column 3).

What kinds of information, summaries, or aggregations could we create using this dataset?

Calculate the average temperature for each day.
Determine the date that had the highest temperature of the month.
Determine the date the had the lowest temperature of the month.
Calculate the average temperature of the entire month.
Create a graph that visually depicts the temperature of all of the days of the month.

Let's walk through a solution that handles all five of these tasks. First, let's set up the data in a Python data structure. We could use options, but for this first example I will format the data and create a list of tuples directly in my code file. In a production system, we would never include our data directly in our code file, but for demonstration, it's easier to see everything (data and code) together. In a production system, the data would be commonly stored in a file or a database table. For now, though, I'll keep it all together in my code file (I'll show a CSV file-based example for comparison below).

Do you remember how to create a list of tuples? Here's the general form we need:

list = [(tuple_1), (tuple_2), (tuple_3), ... (tuple_n)]

In the case of our dataset above, we would form a tuple of each record (row). So, for example, we would reformat the first record of the dataset:

2023-11-01,58,32

("2023-11-01", 58, 32)

And then if we format all 30 of the records in the data set as a tuple, separate each tuple with a comma, and place the entire set of tupes inside the square brackets of a list, it would looke like this:

# Dataset formatted as a list of tuples
data = [
    ("2023-11-01", 58, 32), ("2023-11-02", 64, 35), ("2023-11-03", 67, 44),
    ("2023-11-04", 67, 41), ("2023-11-05", 65, 45), ("2023-11-06", 71, 48),
    ("2023-11-07", 52, 38), ("2023-11-08", 49, 34), ("2023-11-09", 49, 31),
    ("2023-11-10", 53, 31), ("2023-11-11", 54, 31), ("2023-11-12", 63, 33),
    ("2023-11-13", 66, 42), ("2023-11-14", 66, 38), ("2023-11-15", 68, 44),
    ("2023-11-16", 59, 41), ("2023-11-17", 58, 36), ("2023-11-18", 52, 39),
    ("2023-11-19", 49, 35), ("2023-11-20", 46, 34), ("2023-11-21", 48, 30),
    ("2023-11-22", 49, 30), ("2023-11-23", 42, 32), ("2023-11-24", 36, 32),
    ("2023-11-25", 38, 29), ("2023-11-26", 36, 29), ("2023-11-27", 40, 25),
    ("2023-11-28", 41, 24), ("2023-11-29", 36, 25), ("2023-11-30", 34, 30)
]

A list of tuples like this will make it reasonably easy to loop through the dataset to process each record.

Next, I'll write the code to create a few variables to use during processing, loop through the list of tuples to calculate the daily average temperatures, append each day to a processed queue (list), and also keep a running total temperature and record count that I'll use at the end to calculate the average temperature for the month. After the calculation loop, I'll print the results.

# Dataset formatted as a list of tuples
data = [
    ("2023-11-01", 58, 32), ("2023-11-02", 64, 35), ("2023-11-03", 67, 44),
    ("2023-11-04", 67, 41), ("2023-11-05", 65, 45), ("2023-11-06", 71, 48),
    ("2023-11-07", 52, 38), ("2023-11-08", 49, 34), ("2023-11-09", 49, 31),
    ("2023-11-10", 53, 31), ("2023-11-11", 54, 31), ("2023-11-12", 63, 33),
    ("2023-11-13", 66, 42), ("2023-11-14", 66, 38), ("2023-11-15", 68, 44),
    ("2023-11-16", 59, 41), ("2023-11-17", 58, 36), ("2023-11-18", 52, 39),
    ("2023-11-19", 49, 35), ("2023-11-20", 46, 34), ("2023-11-21", 48, 30),
    ("2023-11-22", 49, 30), ("2023-11-23", 42, 32), ("2023-11-24", 36, 32),
    ("2023-11-25", 38, 29), ("2023-11-26", 36, 29), ("2023-11-27", 40, 25),
    ("2023-11-28", 41, 24), ("2023-11-29", 36, 25), ("2023-11-30", 34, 30)
]

# Declare variables for use in processing
processed_data = []
total_temp = 0
count = 0

# Iterate through the dataset to calculate values
for date, hi, lo in data:
    avg_temp = (hi + lo) / 2
    processed_data.append((date, hi, lo, avg_temp))
    total_temp += avg_temp
    count += 1

# Calculate the average temperature for the entire month
monthly_avg_temp = total_temp / count

# Output results
print(processed_data)
print("Average Temperature during November 2023: ", "{:.2f}".format(monthly_avg_temp))

Output

[('2023-11-01', 58, 32, 45.0), ('2023-11-02', 64, 35, 49.5), ('2023-11-03', 67, 44, 55.5), ('2023-11-04', 67, 41, 54.0), ('2023-11-05', 65, 45, 55.0), ('2023-11-06', 71, 48, 59.5), ('2023-11-07', 52, 38, 45.0), ('2023-11-08', 49, 34, 41.5), ('2023-11-09', 49, 31, 40.0), ('2023-11-10', 53, 31, 42.0), ('2023-11-11', 54, 31, 42.5), ('2023-11-12', 63, 33, 48.0), ('2023-11-13', 66, 42, 54.0), ('2023-11-14', 66, 38, 52.0), ('2023-11-15', 68, 44, 56.0), ('2023-11-16', 59, 41, 50.0), ('2023-11-17', 58, 36, 47.0), ('2023-11-18', 52, 39, 45.5), ('2023-11-19', 49, 35, 42.0), ('2023-11-20', 46, 34, 40.0), ('2023-11-21', 48, 30, 39.0), ('2023-11-22', 49, 30, 39.5), ('2023-11-23', 42, 32, 37.0), ('2023-11-24', 36, 32, 34.0), ('2023-11-25', 38, 29, 33.5), ('2023-11-26', 36, 29, 32.5), ('2023-11-27', 40, 25, 32.5), ('2023-11-28', 41, 24, 32.5), ('2023-11-29', 36, 25, 30.5), ('2023-11-30', 34, 30, 32.0)]
Average Temperature during November 2023:  43.57

Code & Output Details

Code Lines 2 thru 13: This is the list of tuples in our dataset.
Code Line 16: This is a new list variable we'll use to store our processed list, containing our original data values and the calculated daily average temperature.
Code Line 17: This variable will store the running total temperature that we'll use after looping through our dataset to calculate the average temperature for the month.
Code Line 18: This counter variable will count the number of records in our dataset. Obviously, we know how many there are in this particular dataset, but we often do not know ahead of time, so setting this up as variable-based is more conducive to handling any number of records.
Code Line 21: Next, we use a for loop structure to iterate through the dataset to calculate the daily average temperature and load our processed_data list with those results. In this loop, we'll also collect our total temperature and iterate our counter so we can calculate our average monthly temperature after the loop as well.
- Note: Notice the use of the multi-variable for loop signature. If you need a refresher on this, revisit the for Loop ↗ page.
Code Line 22: This first line inside the for loop calculates the dataset's average temperature for each record (day). Notice that it the calculation (hi + lo) / 2 is assigned to the variable avg_temp.
Code Line 23: This line adds (appends) each record to the processed_data list, effectively copying the original dataset list values date, hi, and lo and adding the calculated avg_temp value.
Code Line 24: This line accumulates the temperature of each day as the loop iterates. We'll use this after the loop to calculate the average temperature for the month.
Code Line 25: This line accumulates the number of records in the dataset. We'll use this, along with the total_temp values, after the loop to calculate the average temperature for the month.
Code Line 28: After the loop has finished, this line uses the total_temp and count variables to calculate the average temperature for the month and stores it in the variable monthly_avg_temp.
Code Line 31: This line prints the processed_data list. Note this is not formatted very well. We'll take a look at improving this below.
Code Line 32: And lastly, this line prints a formatted statement that reports the average temperature for the month.

Improving the Output of the Processed Dataset

As indicated in the Code Details above, printing the processed_data list with no formatting produces output unsuitable for users to read. Since it is a list of tuples, how might we improve the output?

One approach would be to iterate through the processed_data list and print one line for each day that displays the date and average temperature for the date, and also print a set of text characters that visually represent the average temperature as well. Study the following code, which is a copy of the completed code above, with Code Line 31 (above) replaced by Code Lines 32 through 36 in the new version of the code below. See the Code Details below for a description of this change.

# Dataset formatted as a list of tuples
data = [
    ("2023-11-01", 58, 32), ("2023-11-02", 64, 35), ("2023-11-03", 67, 44),
    ("2023-11-04", 67, 41), ("2023-11-05", 65, 45), ("2023-11-06", 71, 48),
    ("2023-11-07", 52, 38), ("2023-11-08", 49, 34), ("2023-11-09", 49, 31),
    ("2023-11-10", 53, 31), ("2023-11-11", 54, 31), ("2023-11-12", 63, 33),
    ("2023-11-13", 66, 42), ("2023-11-14", 66, 38), ("2023-11-15", 68, 44),
    ("2023-11-16", 59, 41), ("2023-11-17", 58, 36), ("2023-11-18", 52, 39),
    ("2023-11-19", 49, 35), ("2023-11-20", 46, 34), ("2023-11-21", 48, 30),
    ("2023-11-22", 49, 30), ("2023-11-23", 42, 32), ("2023-11-24", 36, 32),
    ("2023-11-25", 38, 29), ("2023-11-26", 36, 29), ("2023-11-27", 40, 25),
    ("2023-11-28", 41, 24), ("2023-11-29", 36, 25), ("2023-11-30", 34, 30)
]

# Declare variables for use in processing
processed_data = []
total_temp = 0
count = 0

# Iterate through the dataset to calculate values
for date, hi, lo in data:
    avg_temp = (hi + lo) / 2
    processed_data.append((date, hi, lo, avg_temp))
    total_temp += avg_temp
    count += 1

# Calculate the average temperature for the entire month
monthly_avg_temp = total_temp / count

# Output results

# Print each day with its average temperature and a textual graph representing the temperature
for date, _, _, avg_temp in processed_data:
    hash_graph = '#' * int(avg_temp)
    print(f"{date}: {avg_temp}°F - {hash_graph}")

# Print the average temperature for the month
print("Average Temperature during November 2023: ", "{:.2f}".format(monthly_avg_temp))

Output

2023-11-01: 45.0°F - #############################################
2023-11-02: 49.5°F - #################################################
2023-11-03: 55.5°F - #######################################################
2023-11-04: 54.0°F - ######################################################
2023-11-05: 55.0°F - #######################################################
2023-11-06: 59.5°F - ###########################################################
2023-11-07: 45.0°F - #############################################
2023-11-08: 41.5°F - #########################################
2023-11-09: 40.0°F - ########################################
2023-11-10: 42.0°F - ##########################################
2023-11-11: 42.5°F - ##########################################
2023-11-12: 48.0°F - ################################################
2023-11-13: 54.0°F - ######################################################
2023-11-14: 52.0°F - ####################################################
2023-11-15: 56.0°F - ########################################################
2023-11-16: 50.0°F - ##################################################
2023-11-17: 47.0°F - ###############################################
2023-11-18: 45.5°F - #############################################
2023-11-19: 42.0°F - ##########################################
2023-11-20: 40.0°F - ########################################
2023-11-21: 39.0°F - #######################################
2023-11-22: 39.5°F - #######################################
2023-11-23: 37.0°F - #####################################
2023-11-24: 34.0°F - ##################################
2023-11-25: 33.5°F - #################################
2023-11-26: 32.5°F - ################################
2023-11-27: 32.5°F - ################################
2023-11-28: 32.5°F - ################################
2023-11-29: 30.5°F - ##############################
2023-11-30: 32.0°F - ################################

Average Temperature during November 2023:  43.57

Code & Output Details

Code Line 33: On this line, we use a for loop structure again to iterate through the span style="font-family: 'courier'">processed_data dataset to produce the textual graph of the processed data. Notice the use of the placeholders ( _, _, ) in the for loop signature. If you need a refresher on this, revisit the for Loop ↗ page.
Code Line 34: This line produces the set of characters equivalent to the avg_temp for each day as the loop iterates.
Code Line 35: This line prints the date, average temperature, and the graph for each day as the loop iterates.

Note: While the above textual graph is an improvement over simply printing the Python list of tuples, it is more common that data visualizations are produced in a more graphical (colors, shapes, images, etc.) manner. We will use this example dataset above to create a more visual version of this data (seen here >>) when we look at Data Visualization ↗ later in this chapter.

Publicly Available Datasets

Many organizations and websites make datasets available to the public. Here is a short list of a few of them to get you started exploring different data types available in datasets.

Nasa Earth Observation Data
Google Dataset Search
Kaggle
Data.Gov
World Health Organization (WHO)
BFI Film Industry Statistics
New York City Taxi Trip Record Data
FBI Crime Data Explorer
... and many more, like those found ... here ↗ ... or here ↗ ... or here ↗.

Public datasets are excellent resources for learning, research, and analysis. Most of the datasets available at the above locations are free to use for educational and non-commercial use. Some of these public datasets are small, others are huge. I recommend that you visit some of these sites and explore the types of data available to the public.

« Previous : Files : Phone Case Store

Next : Data (Input) : Databases »

« Previous : Files : Phone Case Store
Next : Data (Input) : Databases »