☰
Python Across Disciplines
with Python + AI Tool   
×
Table of Contents

1.1.   Introduction 1.2.   About the Author & Contact Info 1.3.   Book Conventions 1.4.   What (Who) is a Programmer? 1.5.   Programming Across Disciplines 1.6.   Foundational Computing Concepts 1.7.   About Python 1.8.   First Steps 1.8.1 Computer Setup 1.8.2 Python print() Function 1.8.3 Comments
2.1. About Data 2.2. Data Types 2.3. Variables 2.4. User Input 2.5. Data Structures (DS)         2.5.1. DS Concepts         2.5.2. Lists         2.5.3. Dictionaries         2.5.4. Others 2.6. Files         2.6.1. Files & File Systems         2.6.2. Python File Object         2.6.3. Data Files 2.7. Databases
3.1. About Processing 3.2. Decisions         3.2.1 Decision Concepts         3.2.2 Conditions & Booleans         3.2.3 if Statements         3.2.4 if-else Statements         3.2.5 if-elif-else Statements         3.2.6 In-Line if Statements 3.3. Repetition (a.k.a. Loops)         3.3.1  Repetition Concepts         3.3.2  while Loops         3.3.3  for Loops         3.3.4  Nested Loops         3.3.5  Validating User Input 3.4. Functions         3.4.1  Function Concepts         3.4.2  Built-In Functions         3.4.3  Programmer Defined Functions 3.5. Libraries         3.5.1  Library Concepts         3.5.2  Standard Library         3.5.3  External Libraries 3.6. Processing Case Studies         3.6.1  Case Studies         3.6.2  Parsing Data
4.1. About Output 4.2. Advanced Printing 4.3. Data Visualization   4.4  Sound
  4.5  Graphics
  4.6  Video
  4.7  Web Output
  4.8  PDFs & Documents
  4.9  Dashboards
  4.10  Animation & Games
  4.11  Text to Speech

5.1 About Disciplines 5.2 Accounting 5.3 Architecture 5.4 Art 5.5 Artificial Intelligence (AI) 5.6 Autonomous Vehicles 5.7 Bioinformatics 5.8 Biology 5.9 Bitcoin 5.10 Blockchain 5.11 Business 5.12 Business Analytics 5.13 Chemistry 5.14 Communication 5.15 Computational Photography 5.16 Computer Science 5.17 Creative Writing 5.18 Cryptocurrency 5.19 Cultural Studies 5.20 Data Analytics 5.21 Data Engineering 5.22 Data Science 5.23 Data Visualization 5.24 Drone Piloting 5.25 Economics 5.26 Education 5.27 Engineering 5.28 English 5.29 Entrepreneurship 5.30 Environmental Studies 5.31 Exercise Science 5.32 Film 5.33 Finance 5.34 Gaming 5.35 Gender Studies 5.36 Genetics 5.37 Geography 5.38 Geology 5.39 Geospatial Analysis ☯ 5.40 History 5.41 Humanities 5.42 Information Systems 5.43 Languages 5.44 Law 5.45 Linguistics 5.46 Literature 5.47 Machine Learning 5.48 Management 5.49 Marketing 5.50 Mathematics 5.51 Medicine 5.52 Military 5.53 Model Railroading 5.54 Music 5.55 Natural Language Processing (NLP) 5.56 Network Analysis 5.57 Neural Networks 5.58 Neurology 5.59 Nursing 5.60 Pharmacology 5.61 Philosophy 5.62 Physiology 5.63 Politics 5.64 Psychiatry 5.65 Psychology 5.66 Real Estate 5.67 Recreation 5.68 Remote Control (RC) Vehicles 5.69 Rhetoric 5.70 Science 5.71 Sociology 5.72 Sports 5.73 Stock Trading 5.74 Text Mining 5.75 Weather 5.76 Writing
6.1. Databases         6.1.1 Overview of Databases         6.1.2 SQLite Databases         6.1.3 Querying a SQLite Database         6.1.4 CRUD Operations with SQLite         6.1.5 Connecting to Other Databases
Built-In Functions Conceptss Data Types Date & Time Format Codes Dictionary Methods Escape Sequences File Access Modes File Object Methods Python Keywords List Methods Operators Set Methods String Methods Tuple Methods Glossary Index Appendices   Software Install & Setup
  Coding Tools:
  A.  Python    B.  Google CoLaboratory    C.  Visual Studio Code    D.  PyCharm IDE    E.  Git    F.  GitHub 
  Database Tools:
  G.  SQLite Database    H.  MySQL Database 


Python Across Disciplines
by John Gordon © 2023

Table of Contents

Table of Contents  »  Chapter 2 : Data (Input) : Datasets

Datasets

Subscribe Contact


Contents

Overview

In programming, datasets play an essental role and form the basis of data analysis, machine learning, and many other applications. A dataset is a collection of data, typically organized in a structured format like one or more tables, where rows represent individual records and columns represent attributes or features of these records. For instance, a data set might comprise a list of customers, with each row, detailing a customer's name, age, and purchase history, or it could be a complex set of meteorological data, charting temperatures and weather conditions across different geographies and times.

Concept: Dataset
Full Concepts List: Alphabetical  or By Chapter 

In programming, a dataset is a collection of data that is typically organized and structured in some manner. It serves as the raw material from which information can be extracted, processed, analyzed, and interpreted. Datasets can come in various forms and sizes, ranging from simple arrays of numbers to complex, multi-dimensional structures containing diverse types of data. They can be sourced from various places like databases, file systems, APIs, or generated through simulations and experiments. In the context of data analysis, machine learning, or scientific research, a dataset is crucial as it forms the basis upon which algorithms operate and insights are drawn. The quality, relevance, and integrity of a dataset greatly influence the outcomes of any computational process applied to it. Good dataset management practices involve cleaning, organizing, and sometimes transforming data to make it suitable for specific tasks. The representation and handling of datasets are often facilitated by programming libraries and tools tailored to the specific needs of the data and the objectives of the project.



Datasets can range from small, simple collections used for basic tasks to vast, complex aggregations of information employed in advanced computational analyses. The significance of datasets extends beyond mere data storage; they are invaluable for training algorithms, uncovering insights through data analysis, and driving decision-making processes in various fields.

Many datasets are publicly available, offering a rich resource for programmers and data scientists. These are found on platforms like Kaggle, which hosts datasets ranging from historical election results to the latest trends in video gaming, or the UCI Machine Learning Repository, offering datasets for experimental purposes in machine learning, such as the famous Iris dataset or the Wine Quality dataset. Government websites also provide access to public data, like demographic information and economic indicators.

With its robust libraries and tools, Python excels in importing, processing, and analyzing these data sets, allowing you to glean meaningful information from them. This section will guide you through the fundamentals of working with data sets in Python, illustrating how to access, manipulate, and extract value from these treasure troves of information, thereby unlocking a world of possibilities in data-driven programming.

Example Dataset

Let's look at an example dataset and walk through how we might process it. This dataset contains the high and low temperatures for each day of November 2023 in Salt Lake City, Utah. Notice that there are header rows that lable each attribute (column). Also, there is one record (row) for each day of the month. Each record contains the date, the high temperature and the low temperature for that day.

Date,Hi,Lo
2023-11-01,58,32
2023-11-02,64,35
2023-11-03,67,44
2023-11-04,67,41
2023-11-05,65,45
2023-11-06,71,48
2023-11-07,52,38
2023-11-08,49,34
2023-11-09,49,31
2023-11-10,53,31
2023-11-11,54,31
2023-11-12,63,33
2023-11-13,66,42
2023-11-14,66,38
2023-11-15,68,44
2023-11-16,59,41
2023-11-17,58,36
2023-11-18,52,39
2023-11-19,49,35
2023-11-20,46,34
2023-11-21,48,30
2023-11-22,49,30
2023-11-23,42,32
2023-11-24,36,32
2023-11-25,38,29
2023-11-26,36,29
2023-11-27,40,25
2023-11-28,41,24
2023-11-29,36,25
2023-11-30,34,30

Code Details

What kinds of information, summaries, or aggregations could we create using this dataset?

  1. Calculate the average temperature for each day.
  2. Determine the date that had the highest temperature of the month.
  3. Determine the date the had the lowest temperature of the month.
  4. Calculate the average temperature of the entire month.
  5. Create a graph that visually depicts the temperature of all of the days of the month.

Let's walk through a solution that handles all five of these tasks. First, let's set up the data in a Python data structure. We could use options, but for this first example I will format the data and create a list of tuples directly in my code file. In a production system, we would never include our data directly in our code file, but for demonstration, it's easier to see everything (data and code) together. In a production system, the data would be commonly stored in a file or a database table. For now, though, I'll keep it all together in my code file (I'll show a CSV file-based example for comparison below).

Do you remember how to create a list of tuples? Here's the general form we need:

list = [(tuple_1), (tuple_2), (tuple_3), ... (tuple_n)]
In the case of our dataset above, we would form a tuple of each record (row). So, for example, we would reformat the first record of the dataset:

2023-11-01,58,32
as

("2023-11-01", 58, 32)
And then if we format all 30 of the records in the data set as a tuple, separate each tuple with a comma, and place the entire set of tupes inside the square brackets of a list, it would looke like this:

# Dataset formatted as a list of tuples
data = [
    ("2023-11-01", 58, 32), ("2023-11-02", 64, 35), ("2023-11-03", 67, 44),
    ("2023-11-04", 67, 41), ("2023-11-05", 65, 45), ("2023-11-06", 71, 48),
    ("2023-11-07", 52, 38), ("2023-11-08", 49, 34), ("2023-11-09", 49, 31),
    ("2023-11-10", 53, 31), ("2023-11-11", 54, 31), ("2023-11-12", 63, 33),
    ("2023-11-13", 66, 42), ("2023-11-14", 66, 38), ("2023-11-15", 68, 44),
    ("2023-11-16", 59, 41), ("2023-11-17", 58, 36), ("2023-11-18", 52, 39),
    ("2023-11-19", 49, 35), ("2023-11-20", 46, 34), ("2023-11-21", 48, 30),
    ("2023-11-22", 49, 30), ("2023-11-23", 42, 32), ("2023-11-24", 36, 32),
    ("2023-11-25", 38, 29), ("2023-11-26", 36, 29), ("2023-11-27", 40, 25),
    ("2023-11-28", 41, 24), ("2023-11-29", 36, 25), ("2023-11-30", 34, 30)
]

A list of tuples like this will make it reasonably easy to loop through the dataset to process each record.

Next, I'll write the code to create a few variables to use during processing, loop through the list of tuples to calculate the daily average temperatures, append each day to a processed queue (list), and also keep a running total temperature and record count that I'll use at the end to calculate the average temperature for the month. After the calculation loop, I'll print the results.

# Dataset formatted as a list of tuples
data = [
    ("2023-11-01", 58, 32), ("2023-11-02", 64, 35), ("2023-11-03", 67, 44),
    ("2023-11-04", 67, 41), ("2023-11-05", 65, 45), ("2023-11-06", 71, 48),
    ("2023-11-07", 52, 38), ("2023-11-08", 49, 34), ("2023-11-09", 49, 31),
    ("2023-11-10", 53, 31), ("2023-11-11", 54, 31), ("2023-11-12", 63, 33),
    ("2023-11-13", 66, 42), ("2023-11-14", 66, 38), ("2023-11-15", 68, 44),
    ("2023-11-16", 59, 41), ("2023-11-17", 58, 36), ("2023-11-18", 52, 39),
    ("2023-11-19", 49, 35), ("2023-11-20", 46, 34), ("2023-11-21", 48, 30),
    ("2023-11-22", 49, 30), ("2023-11-23", 42, 32), ("2023-11-24", 36, 32),
    ("2023-11-25", 38, 29), ("2023-11-26", 36, 29), ("2023-11-27", 40, 25),
    ("2023-11-28", 41, 24), ("2023-11-29", 36, 25), ("2023-11-30", 34, 30)
]

# Declare variables for use in processing
processed_data = []
total_temp = 0
count = 0

# Iterate through the dataset to calculate values
for date, hi, lo in data:
    avg_temp = (hi + lo) / 2
    processed_data.append((date, hi, lo, avg_temp))
    total_temp += avg_temp
    count += 1

# Calculate the average temperature for the entire month
monthly_avg_temp = total_temp / count

# Output results
print(processed_data)
print("Average Temperature during November 2023: ", "{:.2f}".format(monthly_avg_temp))

Output

[('2023-11-01', 58, 32, 45.0), ('2023-11-02', 64, 35, 49.5), ('2023-11-03', 67, 44, 55.5), ('2023-11-04', 67, 41, 54.0), ('2023-11-05', 65, 45, 55.0), ('2023-11-06', 71, 48, 59.5), ('2023-11-07', 52, 38, 45.0), ('2023-11-08', 49, 34, 41.5), ('2023-11-09', 49, 31, 40.0), ('2023-11-10', 53, 31, 42.0), ('2023-11-11', 54, 31, 42.5), ('2023-11-12', 63, 33, 48.0), ('2023-11-13', 66, 42, 54.0), ('2023-11-14', 66, 38, 52.0), ('2023-11-15', 68, 44, 56.0), ('2023-11-16', 59, 41, 50.0), ('2023-11-17', 58, 36, 47.0), ('2023-11-18', 52, 39, 45.5), ('2023-11-19', 49, 35, 42.0), ('2023-11-20', 46, 34, 40.0), ('2023-11-21', 48, 30, 39.0), ('2023-11-22', 49, 30, 39.5), ('2023-11-23', 42, 32, 37.0), ('2023-11-24', 36, 32, 34.0), ('2023-11-25', 38, 29, 33.5), ('2023-11-26', 36, 29, 32.5), ('2023-11-27', 40, 25, 32.5), ('2023-11-28', 41, 24, 32.5), ('2023-11-29', 36, 25, 30.5), ('2023-11-30', 34, 30, 32.0)]
Average Temperature during November 2023:  43.57

Code & Output Details

Improving the Output of the Processed Dataset

As indicated in the Code Details above, printing the processed_data list with no formatting produces output unsuitable for users to read. Since it is a list of tuples, how might we improve the output?

One approach would be to iterate through the processed_data list and print one line for each day that displays the date and average temperature for the date, and also print a set of text characters that visually represent the average temperature as well. Study the following code, which is a copy of the completed code above, with Code Line 31 (above) replaced by Code Lines 32 through 36 in the new version of the code below. See the Code Details below for a description of this change.

# Dataset formatted as a list of tuples
data = [
    ("2023-11-01", 58, 32), ("2023-11-02", 64, 35), ("2023-11-03", 67, 44),
    ("2023-11-04", 67, 41), ("2023-11-05", 65, 45), ("2023-11-06", 71, 48),
    ("2023-11-07", 52, 38), ("2023-11-08", 49, 34), ("2023-11-09", 49, 31),
    ("2023-11-10", 53, 31), ("2023-11-11", 54, 31), ("2023-11-12", 63, 33),
    ("2023-11-13", 66, 42), ("2023-11-14", 66, 38), ("2023-11-15", 68, 44),
    ("2023-11-16", 59, 41), ("2023-11-17", 58, 36), ("2023-11-18", 52, 39),
    ("2023-11-19", 49, 35), ("2023-11-20", 46, 34), ("2023-11-21", 48, 30),
    ("2023-11-22", 49, 30), ("2023-11-23", 42, 32), ("2023-11-24", 36, 32),
    ("2023-11-25", 38, 29), ("2023-11-26", 36, 29), ("2023-11-27", 40, 25),
    ("2023-11-28", 41, 24), ("2023-11-29", 36, 25), ("2023-11-30", 34, 30)
]

# Declare variables for use in processing
processed_data = []
total_temp = 0
count = 0

# Iterate through the dataset to calculate values
for date, hi, lo in data:
    avg_temp = (hi + lo) / 2
    processed_data.append((date, hi, lo, avg_temp))
    total_temp += avg_temp
    count += 1

# Calculate the average temperature for the entire month
monthly_avg_temp = total_temp / count

# Output results

# Print each day with its average temperature and a textual graph representing the temperature
for date, _, _, avg_temp in processed_data:
    hash_graph = '#' * int(avg_temp)
    print(f"{date}: {avg_temp}°F - {hash_graph}")

# Print the average temperature for the month
print("Average Temperature during November 2023: ", "{:.2f}".format(monthly_avg_temp))

Output

2023-11-01: 45.0°F - #############################################
2023-11-02: 49.5°F - #################################################
2023-11-03: 55.5°F - #######################################################
2023-11-04: 54.0°F - ######################################################
2023-11-05: 55.0°F - #######################################################
2023-11-06: 59.5°F - ###########################################################
2023-11-07: 45.0°F - #############################################
2023-11-08: 41.5°F - #########################################
2023-11-09: 40.0°F - ########################################
2023-11-10: 42.0°F - ##########################################
2023-11-11: 42.5°F - ##########################################
2023-11-12: 48.0°F - ################################################
2023-11-13: 54.0°F - ######################################################
2023-11-14: 52.0°F - ####################################################
2023-11-15: 56.0°F - ########################################################
2023-11-16: 50.0°F - ##################################################
2023-11-17: 47.0°F - ###############################################
2023-11-18: 45.5°F - #############################################
2023-11-19: 42.0°F - ##########################################
2023-11-20: 40.0°F - ########################################
2023-11-21: 39.0°F - #######################################
2023-11-22: 39.5°F - #######################################
2023-11-23: 37.0°F - #####################################
2023-11-24: 34.0°F - ##################################
2023-11-25: 33.5°F - #################################
2023-11-26: 32.5°F - ################################
2023-11-27: 32.5°F - ################################
2023-11-28: 32.5°F - ################################
2023-11-29: 30.5°F - ##############################
2023-11-30: 32.0°F - ################################

Average Temperature during November 2023:  43.57

Code & Output Details



Note: While the above textual graph is an improvement over simply printing the Python list of tuples, it is more common that data visualizations are produced in a more graphical (colors, shapes, images, etc.) manner. We will use this example dataset above to create a more visual version of this data (seen here >>) when we look at Data Visualization   later in this chapter.

Publicly Available Datasets

Many organizations and websites make datasets available to the public. Here is a short list of a few of them to get you started exploring different data types available in datasets.

Public datasets are excellent resources for learning, research, and analysis. Most of the datasets available at the above locations are free to use for educational and non-commercial use. Some of these public datasets are small, others are huge. I recommend that you visit some of these sites and explore the types of data available to the public.



 





© 2023 John Gordon
Cascade Street Publishing, LLC