☰
Python Across Disciplines
with Python + AI Tool   
×
Table of Contents

1.1.   Introduction 1.2.   About the Author & Contact Info 1.3.   Book Conventions 1.4.   What (Who) is a Programmer? 1.5.   Programming Across Disciplines 1.6.   Foundational Computing Concepts 1.7.   About Python 1.8.   First Steps 1.8.1 Computer Setup 1.8.2 Python print() Function 1.8.3 Comments
2.1. About Data 2.2. Data Types 2.3. Variables 2.4. User Input 2.5. Data Structures (DS)         2.5.1. DS Concepts         2.5.2. Lists         2.5.3. Dictionaries         2.5.4. Others 2.6. Files         2.6.1. Files & File Systems         2.6.2. Python File Object         2.6.3. Data Files 2.7. Databases
3.1. About Processing 3.2. Decisions         3.2.1 Decision Concepts         3.2.2 Conditions & Booleans         3.2.3 if Statements         3.2.4 if-else Statements         3.2.5 if-elif-else Statements         3.2.6 In-Line if Statements 3.3. Repetition (a.k.a. Loops)         3.3.1  Repetition Concepts         3.3.2  while Loops         3.3.3  for Loops         3.3.4  Nested Loops         3.3.5  Validating User Input 3.4. Functions         3.4.1  Function Concepts         3.4.2  Built-In Functions         3.4.3  Programmer Defined Functions 3.5. Libraries         3.5.1  Library Concepts         3.5.2  Standard Library         3.5.3  External Libraries 3.6. Processing Case Studies         3.6.1  Case Studies         3.6.2  Parsing Data
4.1. About Output 4.2. Advanced Printing 4.3. Data Visualization   4.4  Sound
  4.5  Graphics
  4.6  Video
  4.7  Web Output
  4.8  PDFs & Documents
  4.9  Dashboards
  4.10  Animation & Games
  4.11  Text to Speech

5.1 About Disciplines 5.2 Accounting 5.3 Architecture 5.4 Art 5.5 Artificial Intelligence (AI) 5.6 Autonomous Vehicles 5.7 Bioinformatics 5.8 Biology 5.9 Bitcoin 5.10 Blockchain 5.11 Business 5.12 Business Analytics 5.13 Chemistry 5.14 Communication 5.15 Computational Photography 5.16 Computer Science 5.17 Creative Writing 5.18 Cryptocurrency 5.19 Cultural Studies 5.20 Data Analytics 5.21 Data Engineering 5.22 Data Science 5.23 Data Visualization 5.24 Drone Piloting 5.25 Economics 5.26 Education 5.27 Engineering 5.28 English 5.29 Entrepreneurship 5.30 Environmental Studies 5.31 Exercise Science 5.32 Film 5.33 Finance 5.34 Gaming 5.35 Gender Studies 5.36 Genetics 5.37 Geography 5.38 Geology 5.39 Geospatial Analysis ☯ 5.40 History 5.41 Humanities 5.42 Information Systems 5.43 Languages 5.44 Law 5.45 Linguistics 5.46 Literature 5.47 Machine Learning 5.48 Management 5.49 Marketing 5.50 Mathematics 5.51 Medicine 5.52 Military 5.53 Model Railroading 5.54 Music 5.55 Natural Language Processing (NLP) 5.56 Network Analysis 5.57 Neural Networks 5.58 Neurology 5.59 Nursing 5.60 Pharmacology 5.61 Philosophy 5.62 Physiology 5.63 Politics 5.64 Psychiatry 5.65 Psychology 5.66 Real Estate 5.67 Recreation 5.68 Remote Control (RC) Vehicles 5.69 Rhetoric 5.70 Science 5.71 Sociology 5.72 Sports 5.73 Stock Trading 5.74 Text Mining 5.75 Weather 5.76 Writing
6.1. Databases         6.1.1 Overview of Databases         6.1.2 SQLite Databases         6.1.3 Querying a SQLite Database         6.1.4 CRUD Operations with SQLite         6.1.5 Connecting to Other Databases
Built-In Functions Conceptss Data Types Date & Time Format Codes Dictionary Methods Escape Sequences File Access Modes File Object Methods Python Keywords List Methods Operators Set Methods String Methods Tuple Methods Glossary Index Appendices   Software Install & Setup
  Coding Tools:
  A.  Python    B.  Google CoLaboratory    C.  Visual Studio Code    D.  PyCharm IDE    E.  Git    F.  GitHub 
  Database Tools:
  G.  SQLite Database    H.  MySQL Database 


Python Across Disciplines
by John Gordon © 2023

Table of Contents

Table of Contents  »  Chapter 3 : Processing : Repetition : Parsing Data

Parsing Data

Contents

Overview

In Chapter 2, I introduced numerous concepts about data and ways in which we might acquire data to process in our programs. String data in the forms of characters, words, phrases, sentences, paragraphs, etc. are very common in computing and often need to be processed in many different ways. Examples of processing data include counting characters or words, locating instances of particular words or phrases, extracting key words, names, places, dates (often called entitied), validating important information in textual data, and many others. On this page we will explore parsing strings of several types.

What is Parsing?

While we know how to read the entire contents of a file into a variable and print it, we often need to analyze the contents of a file more closely than its entire contents. In order to examine long strings more closely, we use a technique called parsing.

Concept: Parsing
Full Concepts List: Alphabetical  or By Chapter 

Parsing strings in Python involves analyzing a string's structure and extracting specific data from it according to a predefined pattern or structure. This process is essential in various applications, such as data analysis, web scraping, and configuration file management. Python offers multiple methods and libraries to facilitate string parsing, including built-in functions like split() for dividing a string into a list of substrings (called tokens) based on a delimiter, and strip() for trimming whitespace. For more complex parsing needs, Python provides the re module, which supports regular expressions allowing for sophisticated searching, matching, and manipulation of string patterns. This capability enables developers to extract specific pieces of information from strings, validate string formats, or transform strings in powerful ways, adapting to the diverse needs of different programming scenarios.

Here is a visual representation of parsing a string for the purposes of counting the number of words in the string:



Concept: Token
Full Concepts List: Alphabetical  or By Chapter 

Tokens in the context of string data refer to the individual components or pieces that result from dividing a string based on specific criteria, such as delimiters. Tokenization is the process of splitting a string into these smaller parts or tokens, which can be words, numbers, or symbols, depending on the content of the string and the rules applied for splitting. This concept is fundamental in text processing, parsing, and natural language processing (NLP) tasks, where analyzing and understanding the textual data at a granular level is essential. For example, when processing a sentence, tokenization might involve splitting the sentence into individual words or punctuation marks. Python's split() method on strings is a straightforward way to tokenize a string based on whitespace or other specified delimiters. Additionally, libraries like NLTK (Natural Language Toolkit) provide more sophisticated tools for tokenization, capable of handling complex patterns, such as separating contractions or distinguishing punctuation. Tokenization is a crucial first step in preparing text for further analysis, such as counting word frequencies, performing sentiment analysis, or building machine learning models for text classification.

Here is a visual representation of parsing a string for the purposes of counting the number of words in the string:


Dealing with Punctuation

One of the first decisions we need to make when parsing string data is what to do about punctuation. In some cases, we'll want to preserve punctuation during parsing, but mostly commonly we'll remove punctuation so that our parsed tokens are independent of punctuation that existed in the original string. There are a number of approaches we can use to remove punctuation from a string. For now, we'll use repetition to do this, and then later we'll learn more efficient approaches.

Let's use the To be or not to be, that is the question. string example above for an example. In this string, there are two punctuation characters, the comma and the ending period. We could simply use the string replace() method to remove the two punctuation characters, like this:

s = "To be or not to be, that is the question."
print(s)
s = s.replace(",","")
print(s)
s = s.replace(".","")
print(s)

Output

To be or not to be, that is the question.
To be or not to be that is the question.
To be or not to be that is the question

While this works, it's really restricted to the one string example we have right now. What about other strings with different punctuation in them?

A bit more of a generic example would be to use a list of punctation characters we want to eliminate from any string through the use of comparison and repetition (loop) to remove any of the punctuation characters we don't want in our strings. Here's an example:

shake_string = "To be or not to be, that is the question."
out_string = ""
punctuation = ["!", "\"", "#", "$", "%", "&", "\'",
               "(", ")", "*", "+", ",", "-", ".",
               "/", ":", ";", "<", "=", ">", "?",
               "@", "[", "\\", "]", "^", "_", "{",
               "|", "}", "~", "`", "."]
for s in shake_string:
  if s not in punctuation:
    out_string += s
print(shake_string)
print(out_string)

Output

To be or not to be, that is the question.
To be or not to be that is the question

Code Details

Resolving Case Issues

Depending on the reason we are parsing strings, in addition to dealing with punctuation, we may also need to consider whether to maintain the case of the objects in the original string or to make them all the same. In some circumstances, like the need to preserve capitalization of nouns, it is important to maintain the original case. In those circumstances, we would not need to do anything to the string after removing punctuation. However, other circumstances, like the need to be able to easily compare the ojects in the original string with other objects, it may be best to set the entire string to the same case (upper or lower).

For our purposes here, we will set the case to all lower case as we will be comparing objects from our original string with search words and it will be simpler to do so if the cases are all the same. The following code is identical to the code in the punctuation removal section above, other than the addition of the lower() method to the variable s on Code Line 10.

shake_string = "To be or not to be, that is the question."
out_string = ""
punctuation = ["!", "\"", "#", "$", "%", "&", "\'",
               "(", ")", "*", "+", ",", "-", ".",
               "/", ":", ";", "<", "=", ">", "?",
               "@", "[", "\\", "]", "^", "_", "{",
               "|", "}", "~", "`", "."]
for s in shake_string:
  if s not in punctuation:
    out_string += s.lower()    # << This is the change to lower case the output string.
print(shake_string)
print(out_string)

The decision to change case or not is often indicated in data processing requirements or becomes obvious as we develop our solutions.

Parsing Strings

Now that we have learned the fundamentals of dealing with punctuation, now we'll learn to parse strings. The primary string method we use to parse strings is the split() method. This method divides a string into tokens (substrings) based on a specified delimiter (separator) and stores the tokens in a list. By default, the delimiter is any kind of space (spaces, newlines, tabs, etc.), otherwise split() will divide the string using a separator that we specify. The result of using the split() method is a Python list containing the divided (parsed) parts (tokens) of the original string.

Concept: Delimiter
Full Concepts List: Alphabetical  or By Chapter 

A delimiter is a character or sequence of characters used to specify the boundary between separate, independent regions in text or data streams. Delimiters play a crucial role in string data processing, particularly in parsing, where they help to split a string into components or tokens based on specific boundaries. Common examples of delimiters include commas (,), semicolons (;), spaces ( ), and newlines (\n). Python provides various methods to work with delimited string data, such as the split() method of string objects, which splits a string into a list of tokens based on a specified delimiter. For instance, using a comma as a delimiter, a CSV (Comma-Separated Values) string can be parsed into individual components. The choice of delimiter is crucial for accurately interpreting the intended structure of the data, and it often depends on the format and specifications of the input data being processed.

Here are a few examples of using split() with the default delimiter and using a specified delimiter:

Example 1

# Using split() with the default delimiter (that is, none specified)
txt = "This is a string."
lst = txt.split()
print(lst)

Output

['This', 'is', 'a', 'string.']

Code & Output Details

Example 2

If we use the same code again, but this time include a specific delimiter, such as the letter s, we can see different results:

# Using split() and specifying a delimiter
txt = "This is a string."
lst = txt.split('s')
print(lst)

Output

['Thi', ' i', ' a ', 'tring.']

Code & Output Details

Example 3

Another (more common) example of using split() with a delimiter is comma-separated data where there are words or strings separated by commas, like this:

# Using split() and specifying comma as the delimiter
txt = "Bob Smith, 1234 Nowhere Lane, Someplace, UT, 84999"
lst = txt.split(',')
print(lst)

Output

['Bob Smith', ' 1234 Nowhere Lane', ' Someplace', ' UT', ' 84999']

Code & Output Details

Use Cases

Once we have parsed a string, and have a list of the tokens, we can use those tokens for various purposes. Here are some examples:

Use Case: Counting Words

A common task to perform with a parsed string is to count the number of words in the string. As with most things in programming, there's more than one way to count words in a string. Since we are focued on repetition structures in this chapter, we'll look at an approach using repetition for now, and will learn more approaches as we proceed. Since the tokens of a string are stored in a list as a result of using the split() method, counting the number of words in a string is actually counting tokens in a list. Given this, we can simply iterate through the list using the for in construct and use an accumulator variable to count the number of iterations through the list.

Here's an example:

# For the sake of simplicity and focus, we will assume that punctuation
# has already been removed from the string as described above.
txt = "to be or not to be that is the question"
accumulator = 0
lst = txt.split()
for tok in lst:
    accumulator += 1
print("Number of words (tokens) in the string:", accumulator)

Output

Number of words (tokens) in the string: 10

Code & Output Details

Alternative Approach Using a Python Function

Because the split() method stores the string's tokens in a list, there is a much simpler approach of counting the number of tokens in the list without the need for looping at all, that is, using the len() function. Here's an example:

txt = "To be or not to be that is the question"
lst = txt.split()
print("Number of words:", len(lst))

Note that in this alternative code, there is no need for the accumulator or the loop, in the print statement we can simply use the len() function instead. Since we are focused on repetition however, we will use repetition in the following use cases as well.


Use Case: Counting Specific Words

In this use case, we want to count the number of times a particular word (token) is in a string. The code to do this is very similar to the code we wrote in the previous example use case for counting words. The key difference is that in the code block inside of the loop, we'll need a conditional decision to determine if each word in the list matches the specified search word. When it matches, we'll increment our accumulator. When the token does not match, then we'll skip to the next token. Here's the code to accomplish this:

txt = "To be or not to be that is the question"
search_word = "be"
accumulator = 0
lst = txt.split()
for tok in lst:
  if tok == search_word:
    accumulator += 1
print("Number of occurrences of the word '" + search_word + "' is", accumulator, end=".")

Output

Number of occurrences of the word 'be' is 2.

Code & Output Details


Use Case: Reversing a String by Word

Another use case that will add to your understanding of how to process strings that have been split() is reversing the string by word. Reversing strings usually means to reverse all of the characters, last character first to first character last. For example, if we start with the string to be or not to be and reverse it, it would turn out to be eb ot ton ro eb ot. This is reversing the order of all characters, however we want to reverse the order of the words in the string to read like be to not or be to instead. Since the split() method loads a list with the parsed tokens of the original string, there are a couple of ways in which we can reverse the word order.

Since we are focused on repetition in this chapter, we'll use a loop to reverse the order of the tokens in a list. Here's an example:

txt = "to be or not to be that is the question"
lst = txt.split()
rev_lst = []
for tok in lst:
    rev_lst.insert(0, tok)
print("Original List: ", lst)
print("Reversed List: ", rev_lst)

Output

Original List:  ['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']
Reversed List:  ['question', 'the', 'is', 'that', 'be', 'to', 'not', 'or', 'be', 'to']

Code & Output Details

Alternative Approach Using a List Method

In Python, the list data structure includes a method for reversing the list. Here's an example:

txt = "To be or not to be that is the question"
lst.reverse()
print("Reversed List:", lst)

Note that in this alternative code, there is no need for a loop, we can simply use the list reverse() method. One detail to note that may be important depending on the application--the reverse() method reverses the original list's order. So, if the original order needs to be preserved, this approach of using reverse() on the original list may not be the best choice.

Using Repetition & Parsing with Data Files

As introduced in Chapter 2 , Data files are common resources we use to store and transmit data. There are many types of files we could engage with as programmers. For now, we will work with plain text files (files with a file extension of .txt) and comma-separate value files (files with a file extension of .csv). These two types of files are very common and learning to work with them programmatically is an important skill.

TXT Files

Text files (often referred to as plain text files) can contain any type of alphanumeric data in various forms. In Chapter 2  we saw that it is easy to create, read, and write the contents of a text file. In the example found there as a guide, let's create a file, write a name to it, close the file, reopen it, read its contents, and then print the contents:

# Open a file to store some text. Since this file
# does not exist, it will be created by the open
# function in the following code:
with open('example.txt', 'w') as file:
    file.write("This is a sentence.\n")    # < Note the use of the \n escape sequence

# Reopen the file and read it
with open('example.txt', 'r') as file:
    contents = file.read()
    print(contents)

Output:

This is a sentence.

Now that the text file exists, let's reopen it and add more lines of text, like this:

# Append more text
with open('example.txt', 'a') as file:
    file.write("This is another sentence.\n")    # < Note the use of the \n escape sequence
    file.write("And another.\n")     # < in the data written to the file
    file.write("We can append as many lines of text as we want.\n")
    # We could have done all of these in one write() line since the 
    # escape sequence will push each name to its own line, like this:
    file.write("More text\n... and more ...\nok, ok, that's enough.\n")
    file.close()

# Reopen the file and read it
with open('example.txt', 'r') as file:
    contents = file.read()
    print(contents)
    file.close

Output:

This is a sentence.
This is another sentence.
And another.
We can append as many lines of text as we want.
More text
... and more ...
ok, ok, that's enough.

In the code above we use the print(contents) statement after we've read the file contents. This works fine for our purposes here where we just want to print the entire contents of the file. However, in other instances we may want to work with the file contents one line at a time instead. To do this we can use repetition and the readline() method of the file object.

Here's a code example of reading each line of the example.txt file we created above.

line_count = 0
my_file = open("example.txt", "r")
line = my_file.readline()
while line:
    line_count += 1
    print("Line " + str(line_count) + ": Length: " + str(len(line)) + "\t" + line, end="")
    line = my_file.readline()
my_file.close()

Output:

Line 1: Length: 20      This is a sentence.
Line 2: Length: 26      This is another sentence.
Line 3: Length: 13      And another.
Line 4: Length: 48      We can append as many lines of text as we want.
Line 5: Length: 10      More text
Line 6: Length: 17      ... and more ...
Line 7: Length: 23      ... ok, ok, that's enough.

Code Details:

Another example of why we might read a file one line at a time is that we might want to store each line in a data structure, which as a list, for further processing.

Here's a code example of reading each line of the example.txt file we created above and store each one in a list.

lst = []
my_file = open("example.txt", "r")
line = my_file.readline()
while line:
    lst.append(line)
    line = my_file.readline()
my_file.close()
print(lst)

Output:

['This is a sentence.\n', 'This is another sentence.\n', 'And another.\n', 'We can append as many lines of text as we want.\n', 'More text\n', '... and more ...\n', "... ok, ok, that's enough.\n"]

Code Details:

In the previous examples we wrote individual lines to a text file, so we treated each line (sentence) individually. However, often times text files contain paragraphs, that is, more than one sentence with no line breaks (\n). When we read text files containing paragraphs, we'll often read the entire contents and then use the split() function to parse paragraphs into sentences as needed.

In the next code example, we'll write a paragraph to a text file and then read it and prints its contents. The text used here is generic placeholder text from Lorem Ipsum generator , which can be a very useful tool to practice processing text in Python.

txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."""

with open('example.txt', 'w') as file:
    file.write(txt)

with open('example.txt', 'r') as file:
    contents = file.read()
    print(contents)

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 

Code Details:

It is important to remember that the paragraph above is one long string. So when we store the contents of the file in the contents variable (Code Line 13 above), it contains the entire paragraph. We can parse it into individual sentences if we need to using the split() function. Here's an example:

txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. """

with open('example.txt', 'w') as file:
    file.write(txt)

with open('example.txt', 'r') as file:
    contents = file.read()
    print(contents)

# Now let's parse the paragraph stored in the contents variable
# into indivdual sentences using the split() method, and store
# the sentences in a list, and print the list
sentences = contents.split(". ")
# Confirm the data type of the sentences variable ...
print()
print(type(sentences))
print()
# Print the entire list...
print(sentences)
print()
# Now print the sentences one per line ...
for s in sentences:
  print(s)

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

< class 'list' >

['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua', 'Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat', 'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur', 'Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.']

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

Code Details:



Introduction to Text Analysis

When we're working with text files, we often need to programmatically analyze the contents of a file as an act of discovery about those contents. Text Analysis is a speciality within programming and can be very sophisticated. There are some fundamental tasks within the subfield we can explore here as part of our exploration of repetition and parsing.

We'll use an example to learn approaches to complete the following tasks on a text file:

Note 1: If you would like to follow along with the provided Solution to this Practice Problem you can download the text file I use for this code example here.

Note 2: Study the code comments carefully for information about the code segments.

Code

print("-" * 80)
print("Simple Analysis of File: DeclarationOfIndependence.txt")
print("-" * 80)
# Create list of punctuation characters we'll use to remove punctuation from the content
punctuation = ['~', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '`',
            '-', '=', '{', '}', '|', '[', ']', '\\', ':', '"', ';', '<', '>', '?',
            ',', '.', '/']
# Connect to our text file
file_contents = open(r"C:\Users\John\Documents\declarationofindependence.txt", "r")
# Read file contents into string variable
declaration = file_contents.read()
# Set everything to lower case
declaration = declaration.lower()
# Remove punctuation from string
for i in punctuation:
declaration = declaration.replace(i, ' ')
# Split string into individual words into list
words = declaration.split()
# The number of words in the file is the length of the words list
print("Number of words in the file: " + str(len(words)))
# Gather list of unique words in the content
# Create an empty list to store unique words
unique_words = []
# Loop through the words list and add each word to
# the unique_words list just once
for word in words:
    if word not in unique_words:
        unique_words.append(word)
print("Number of unique words in the file: " + str(len(unique_words)))
# Sort the list
unique_words.sort()
# Set counter variables we will use to identify the list index,
# the longest word length and index
list_index = 0
longest_word_length = 0
longest_word_index = 0
for w in unique_words:
    # Determine the longest word in the unique list
    if len(w) > longest_word_length:
        longest_word_index = list_index
        longest_word_length = len(w)
    # Replace the word string with the string plus the number of
    # occurrences in the words list in parenthesis
    unique_words[list_index] = w + " (" + str(words.count(w)) + ")"
    list_index += 1
print("-" * 80)
print("Longest Word: " + unique_words[longest_word_index] +
" Length: " + str(longest_word_length))
print("-" * 80)
print("Unique Words and Number of Occurrences Each:")
print("-" * 120)
list_index = 0
line_count = 1
for w in unique_words:
    print(unique_words[list_index].ljust(20) + "  ", end="")
    if line_count % 5 == 0:
        print()
    list_index += 1
    line_count += 1
print()
print("-" * 120)

Output

--------------------------------------------------------------------------------
Simple Analysis of File: DeclarationOfIndependence.txt
--------------------------------------------------------------------------------
Number of words in the file: 1326
Number of unique words in the file: 534
--------------------------------------------------------------------------------
Longest Word: representatives (1) Length: 15
--------------------------------------------------------------------------------
Unique Words and Number of Occurrences Each:
------------------------------------------------------------------------------------------------------------------------
a (16)                abdicated (1)         abolish (1)           abolishing (3)        absolute (3)
absolved (1)          abuses (1)            accommodation (1)     accordingly (1)       accustomed (1)
acquiesce (1)         act (1)               acts (2)              administration (1)    affected (1)
after (1)             against (2)           ages (2)              all (10)              allegiance (1)
alliances (1)         alone (1)             already (1)           alter (2)             altering (1)
america (1)           among (5)             amongst (1)           amount (1)            an (5)
and (57)              annihilation (1)      another (1)           answered (1)          any (2)
appealed (1)          appealing (1)         appropriations (1)    arbitrary (1)         are (9)
armed (1)             armies (2)            arms (1)              as (4)                assembled (1)
assent (4)            assume (1)            at (4)                attempts (1)          attend (1)
attentions (1)        authority (1)         away (1)              bands (1)             barbarous (1)
be (9)                bear (1)              become (1)            becomes (2)           been (4)
begun (1)             benefits (1)          between (1)           beyond (1)            bodies (2)
boundaries (1)        brethren (2)          bring (1)             britain (2)           british (2)
burnt (1)             but (1)               by (13)               called (1)            candid (1)
captive (1)           cases (2)             cause (1)             causes (2)            certain (1)
changed (1)           character (1)         charters (1)          circumstances (2)     citizens (1)
civil (1)             civilized (1)         coasts (1)            colonies (4)          combined (1)
commerce (1)          commit (1)            common (1)            complete (1)          compliance (1)
conclude (1)          conditions (2)        congress (1)          conjured (1)          connected (1)
connection (1)        connections (1)       consanguinity (1)     consent (3)           constitution (1)
constrained (1)       constrains (1)        contract (1)          convulsions (1)       correspondence (1)
country (1)           course (1)            created (1)           creator (1)           crown (1)
cruelty (1)           cutting (1)           dangers (1)           deaf (1)              death (1)
decent (1)            declaration (2)       declare (2)           declaring (2)         define (1)
denounces (1)         dependent (1)         depository (1)        depriving (1)         deriving (1)
design (1)            desolation (1)        despotism (1)         destroyed (1)         destruction (1)
destructive (1)       dictate (1)           direct (1)            disavow (1)           disposed (1)
dissolutions (1)      dissolve (1)          dissolved (2)         distant (1)           districts (1)
divine (1)            do (3)                domestic (1)          duty (1)              each (1)
earth (1)             eat (1)               effect (1)            elected (1)           emigration (1)
encourage (1)         endeavored (2)        endowed (1)           ends (1)              enemies (1)
english (1)           enlarging (1)         entitle (1)           equal (2)             erected (1)
establish (1)         established (1)       establishing (2)      establishment (1)     events (1)
every (2)             evident (1)           evils (1)             evinces (1)           example (1)
excited (1)           executioners (1)      exercise (1)          experience (1)        exposed (1)
extend (1)            facts (1)             fall (1)              fatiguing (1)         fellow (1)
firm (1)              firmness (1)          fit (1)               for (29)              forbidden (1)
foreign (2)           foreigners (1)        form (2)              former (1)            formidable (1)
forms (2)             fortunes (1)          foundation (1)        free (4)              friends (2)
from (6)              frontiers (1)         full (1)              fundamentally (1)     future (1)
general (1)           giving (1)            god (1)               good (2)              governed (1)
government (6)        governments (3)       governors (1)         great (2)             guards (1)
hands (1)             happiness (2)         harass (1)            has (21)              have (11)
having (1)            he (19)               head (1)              here (2)              high (1)
his (9)               history (2)           hither (2)            hold (3)              honor (1)
houses (1)            human (1)             humble (1)            immediate (1)         impel (1)
importance (1)        imposing (1)          in (19)               incapable (1)         indeed (1)
independence (1)      independent (4)       indian (1)            inestimable (1)       inevitably (1)
inhabitants (2)       injuries (1)          injury (1)            institute (1)         instituted (1)
instrument (1)        insurrections (1)     intentions (1)        interrupt (1)         into (2)
introducing (1)       invariably (1)        invasion (1)          invasions (1)         invested (1)
is (10)               it (6)                its (3)               judge (1)             judges (1)
judiciary (1)         jurisdiction (2)      jury (1)              just (1)              justice (3)
kept (1)              kindred (1)           king (1)              known (1)             lands (1)
large (4)             laws (9)              laying (1)            legislate (1)         legislation (1)
legislative (2)       legislature (2)       legislatures (2)      let (1)               levy (1)
liberty (1)           life (1)              light (1)             likely (1)            lives (2)
long (3)              made (1)              magnanimity (1)       mankind (3)           manly (1)
many (1)              marked (1)            may (2)               meantime (1)          measures (1)
men (2)               mercenaries (1)       merciless (1)         migrations (1)        military (1)
mock (1)              more (1)              most (5)              multitude (1)         murders (1)
must (1)              mutually (1)          name (1)              nation (1)            native (1)
naturalization (1)    nature (1)            nature’s (1)        necessary (2)         necessity (2)
neglected (1)         neighboring (1)       new (4)               nor (1)               not (1)
now (1)               object (2)            obstructed (1)        obstructing (1)       obtained (1)
of (77)               off (2)               offenses (1)          officers (1)          offices (2)
on (8)                once (1)              one (1)               only (2)              operation (1)
opinions (1)          opposing (1)          oppressions (1)       or (2)                organizing (1)
other (3)             others (3)            ought (2)             our (26)              out (2)
over (2)              own (1)               paralleled (1)        parts (1)             pass (3)
patient (1)           payment (1)           peace (3)             people (10)           perfidy (1)
petitioned (1)        petitions (1)         places (1)            pledge (1)            plundered (1)
political (2)         population (1)        power (3)             powers (5)            present (1)
pressing (1)          pretended (2)         prevent (1)           prince (1)            principles (1)
protecting (1)        protection (2)        prove (1)             provide (1)           providence (1)
province (1)          prudence (1)          public (2)            publish (1)           punishment (1)
purpose (2)           pursuing (1)          pursuit (1)           quartering (1)        raising (1)
ravaged (1)           records (1)           rectitude (1)         redress (1)           reduce (1)
refused (3)           refusing (2)          reliance (1)          relinquish (1)        remaining (1)
reminded (1)          render (2)            repeated (3)          repeatedly (1)        representation (1)
representative (1)    representatives (1)   requires (1)          respect (1)           rest (1)
returned (1)          right (7)             rights (3)            rule (2)              ruler (1)
sacred (1)            safety (1)            salaries (1)          same (2)              savages (1)
scarcely (1)          seas (3)              secure (1)            security (1)          seem (1)
self (1)              sent (1)              separate (1)          separation (2)        settlement (1)
sexes (1)             shall (1)             should (4)            shown (1)             so (2)
sole (1)              solemnly (1)          stage (1)             standing (1)          state (2)
states (7)            station (1)           subject (1)           submitted (1)         substance (1)
such (6)              suffer (1)            sufferable (1)        sufferance (1)        superior (1)
support (1)           supreme (1)           suspended (2)         suspending (1)        swarms (1)
system (1)            systems (1)           taken (1)             taking (1)            taxes (1)
tenure (1)            terms (1)             than (1)              that (13)             the (77)
their (20)            them (15)             themselves (3)        therefore (2)         therein (1)
these (13)            they (7)              things (1)            this (3)              those (1)
throw (1)             thus (1)              ties (1)              till (1)              time (4)
times (1)             to (65)               together (1)          too (1)               totally (2)
towns (1)             trade (1)             train (1)             transient (1)         transporting (2)
trial (2)             tried (1)             troops (1)            truths (1)            tyranny (2)
tyrant (1)            tyrants (1)           unacknowledged (1)    unalienable (1)       uncomfortable (1)
under (1)             undistinguished (1)   unfit (1)             united (2)            unless (2)
unusual (1)           unwarrantable (1)     unworthy (1)          us (11)               usurpations (3)
utterly (1)           valuable (1)          voice (1)             waging (1)            wanting (1)
war (3)               warfare (1)           warned (1)            we (11)               whatsoever (1)
when (3)              whenever (1)          whereby (1)           which (10)            while (1)
wholesome (1)         whose (2)             will (2)              with (9)              within (1)
without (3)           works (1)             world (3)             would (2)
------------------------------------------------------------------------------------------------------------------------


CSV Files

Another type of text file commonly used to store and transfer data is called a Comma-Separated Value (CSV) file. CSV files are used to store and exchange data between different software applications. The content of CSV files are plain text, like a text (TXT) file. The primary difference between a plain text file (above) and a CSV file is that plain text files often contain unstructured data, that is, words, phrases, paragraphs, etc. that are not in any standard format. The content in CSV files, on the other hand, are in a tabular (rows and colums) format.

Figure 1 depicts that general format of a CSV file:



Figure 1: Structure of a CSV File Containing One or More Records


In Figure 1 we see that a CSV file is made up of from one-to-many (n) rows, and one-to-many (m) columns. Each row is called a record and each column is called field. Each record is made up of fields that describe that record. For example, a customer record might be made up of a Customer ID number, that customer's first name, last name, address, phone number and email address. So that record would contain fields in the CSV file, one for each of those attributes about the customer.

A CSV file containing customer records as described above might look like this:



Figure 2: Example CSV File

Notice that this example file (named Customers.csv) contains ten customer records, one per row (or line) in the file. In this example, also, this file contains an header row of column titles that help us discern what each column represents in the data. A header row in a CSV file is optional.

Creating a Simple CSV File with One Field

We can create a simple CSV file by opening a new file with a file name and the .csv file extension, like this:

# Open (create) a new file we'll use to store names
with open('Customers.csv', 'w') as file:
    file.write("Daffy\n")    # < Note the use of the \n escape sequence
    file.close()

# Reopen the file and read it
with open('names.csv', 'r') as file:
    contents = file.read()
    print(contents)
    file.close

Output:

Daffy

This simple CSV file contains one record with one field, with no header row in it. While this is very minimal, it is a valid CSV file. We could have included a header row for it by adding one write additional write statement, like this:

# Open (create) a new file we'll use to store names
with open('Customers.csv', 'w') as file:
    file.write("FirstName\n")    # < Added this write statement to include a column header line.
    file.write("Daffy\n")
    file.close()

# Reopen the file and read it
with open('names.csv', 'r') as file:
    contents = file.read()
    print(contents)
    file.close

Output:

FirstName
Daffy

Next, let's reopen it in append mode and add more names, like this:

# Append more names
with open('Customers.csv', 'a') as file:
    file.write("Marvin\n")
    file.write("Tazmanian\n")
    file.write("Bugs\n")
    file.write("Space\n")
    file.write("Yogi\n")
    # We could have done all of these in one write() line since the 
    # escape sequence will push each name to its own line, like this:
    file.write("Fred\nScooby\nMickey\nCharlie\n")
    file.close()

# Reopen the file and read it
with open('Customers.csv', 'r') as file:
    contents = file.read()
    print(contents)
    file.close

Output:

FirstName
Daffy
Marvin
Tazmanian
Bugs
Space
Yogi
Fred
Scooby
Mickey
Charlie

Creating a CSV File with More Than One Field

Most of the time CSV files contain more than one field per record. Here is an example, expanding on the example above:

with open('Customers.csv', 'w') as file:
    file.write("FirstName,LastName,Address\n")    # < Write the header row to the file
    file.write("Daffy,Duck,123 Quackville Road\n")
    file.write("Marvin,Martian,234 Crater Lane\n")
    file.write("Tazmanian,Devil,345 Taz Street\n")
    file.write("Bugs,Bunny,456 Carrot Blvd.\n")
    file.write("Space,Ghost,999 Space Lane\n")
    file.write("Yogi,Bear,454 Bear Blvd.\n")
    file.write("Fred,Flintstone,825 Rock Street\n")
    file.write("Scooby,Doo,444 Snacks Street\n")
    file.write("Mickey,Mouse,356 Squeek Lane\n")
    file.write("Charlie,Brown,987 Snoopy Street\n")
    file.close()

# Reopen the file and read it
with open('Customers.csv', 'r') as file:
    contents = file.read()
    print(contents)
    file.close

Output:

FirstName,LastName,Address
Daffy,Duck,123 Quackville Road
Marvin,Martian,234 Crater Lane
Tazmanian,Devil,345 Taz Street
Bugs,Bunny,456 Carrot Blvd.
Space,Ghost,999 Space Lane
Yogi,Bear,454 Bear Blvd.
Fred,Flintstone,825 Rock Street
Scooby,Doo,444 Snacks Street
Mickey,Mouse,356 Squeek Lane
Charlie,Brown,987 Snoopy Street

Now we have a CSV file containing multiple records, each of which with more than one field.


Problem: Create a CSV file from user data entry

Write a Python program that prompts the user for customer records and writes those records to a CSV file called Customers.csv. The CSV file should contain the following fields:

  • Customer ID
  • First Name
  • Last Name
  • Address
  • City
  • State
  • Zip Code
  • Phone Number
  • Email Address




Using Repetition to Read a CSV Line by Line

When we work with CSV files, it is often necessary to read the file one line at a time because each line in a CSV is a record. Using our Customers.csv file as an example, we can use a loop to read each line of the file, that is, each customer record, one a time. In our loop then we can process each record as needed.

The following example demonstrates a common pattern we may use when working with files, that is, we read the file, one line at a time, then for each line we do things with the attributes in the record. In this case, we are printing each attribute and handling each separately so that we can establish proper column widths or slicing and concatenating (phone number for example). In addition, we're using a counter to count the number of records in the file so that we can print the number as a summary in the footer of the report. Be sure to read through the Code Details under the sample output below.

# Global Variables
report_width = 120

# Functions
def print_header():
    report_title = "C u s t o m e r  R e p o r t"
    print("-" * report_width)
    print(" " * int((report_width / 2) - len(report_title) / 2), end="")
    print(report_title)
    print("-" * report_width)
    print("ID".ljust(8), end="")
    print("First".ljust(12), end="")
    print("Last".ljust(12), end="")
    print("Address".ljust(25), end="")
    print("City".ljust(15), end="")
    print("ST".ljust(5), end="")
    print("Zip".ljust(7), end="")
    print("Phone".ljust(17), end="")
    print("Email".ljust(30))
    print("-" * report_width)

def print_footer(counter):
    print("-" * report_width)
    print("Number of Customers: " + str(counter))
    print("-" * report_width)

# Main Program
customer_count = 0
print_header()
my_file = open(r"Customers.csv", "r")
for line in my_file:
    customer_record = line.rstrip().split(',')
    print(customer_record[0].ljust(8), end="")
    print(customer_record[1].ljust(12), end="")
    print(customer_record[2].ljust(12), end="")
    print(customer_record[3].ljust(25), end="")
    print(customer_record[4].ljust(15), end="")
    print(customer_record[5].ljust(5), end="")
    print(customer_record[6].ljust(7), end="")
    print("(" + customer_record[7][0:3] + ")" +
          customer_record[7][3:6] + "-" +
          customer_record[7][6:].ljust(8), end="")
    print(customer_record[8].ljust(30), end="")
    print()
    customer_count += 1
print_footer(customer_count)

Output:

------------------------------------------------------------------------------------------------------------------------
                                    C u s t o m e r  R e p o r t
------------------------------------------------------------------------------------------------------------------------
ID      First       Last        Address                  City           ST   Zip    Phone            Email
------------------------------------------------------------------------------------------------------------------------
123456  Daffy       Duck        123 Quackville Road      Feathers       UT   84555  (222)333-4444    daff@quack.com
234567  Marvin      Martian     234 Crater Lane          Mars           UT   84777  (333)444-5555    marv@mars.org
345678  Tazmanian   Devil       345 Taz Street           Mania          UT   84222  (444)555-6666    taz@devdev.net
456789  Bugs        Bunny       456 Carrot Blvd.         Hopp           UT   84999  (555)777-2222    buggs@hoppy.com
567890  Space       Ghost       999 Space Lane           Orbit          UT   84333  (666)777-8888    ghostie@rocket.com
678901  Yogi        Bear        454 Bear Blvd.           Bearville      UT   84524  (777)888-9999    yogi@bear.net
789012  Fred        Flintstone  825 Rock Street          Bedrock        UT   84846  (888)999-0000    freddy@stone.com
890123  Scooby      Doo         444 Snacks Street        Doggo          UT   84000  (999)111-2222    scoob@snackers.com
901234  Mickey      Mouse       356 Squeek Lane          Cheese         UT   84567  (111)222-3333    mick@mouse.com
466577  Charlie     Brown       987 Snoopy Street        Chuck          UT   84575  (234)645-7737    chuck@peanuts.com
------------------------------------------------------------------------------------------------------------------------
Number of Customers: 10
------------------------------------------------------------------------------------------------------------------------

Code Details:



Using the CSV Library

In Python, we can use the csv library to work with csv files. You can find full documentation for this library here . And here is an example of using the CSV library:

Code

import csv

with open('Customers.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['ID','FirstName', 'LastName', 'Address', 'City', 'State', 'Zip','Phone','Email'])
    writer.writerow(['123456','Daffy','Duck','123 Quackville Road','Feathers','UT','84555','(222)333-4444','daff@quack.com'])
    writer.writerow(['234567','Marvin','Martian','234 Crater Lane','Mars','UT','84777','(333)444-5555','marv@mars.org'])
    writer.writerow(['345678','Tazmanian','Devil','345 Taz Street','Mania','UT','84222','(444)555-6666','taz@devdev.net'])
    writer.writerow(['456789','Bugs','Bunny','456 Carrot Blvd.','Hopp','UT','84999','(555)777-2222','buggs@hoppy.com'])
    writer.writerow(['567890','Space','Ghost','999 Space Lane','Orbit','UT','84333','(666)777-8888','ghostie@rocket.com'])

with open('Customers.csv', 'a', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['678901','Yogi','Bear','454 Bear Blvd.','Bearville','UT','84524','(777)888-9999','yogi@bear.net'])
    writer.writerow(['789012','Fred','Flintstone','825 Rock Street','Bedrock','UT','84846','(888)999-0000','freddy@stone.com'])
    writer.writerow(['890123','Scooby','Doo','444 Snacks Street','Doggo','UT','84000','(999)111-2222','scoob@snackers.com'])
    writer.writerow(['901234','Mickey','Mouse','356 Squeek Lane','Cheese',' UT','84567','(111)222-3333','mick@mouse.com'])
    writer.writerow(['466577','Charlie','Brown','987 Snoopy Street','Chuck','UT','84575','(234)645-7737','chuck@peanuts.com'])

with open('Customers.csv', 'r', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        print("Row: ", row)
    print()
    print(row[0], "\t" + row[1] + " " + row[2] + "\n\t" + row[3] + "\n\t" +
          row[4] + ", " + row[5] + " " + row[6] + "\n\t" + row[7] + "\n\t" +
          row[8] + "\n")

Output

Row:  ['ID', 'FirstName', 'LastName', 'Address', 'City', 'State', 'Zip', 'Phone', 'Email']
Row:  ['123456', 'Daffy', 'Duck', '123 Quackville Road', 'Feathers', 'UT', '84555', '(222)333-4444', 'daff@quack.com']
Row:  ['234567', 'Marvin', 'Martian', '234 Crater Lane', 'Mars', 'UT', '84777', '(333)444-5555', 'marv@mars.org']
Row:  ['345678', 'Tazmanian', 'Devil', '345 Taz Street', 'Mania', 'UT', '84222', '(444)555-6666', 'taz@devdev.net']
Row:  ['456789', 'Bugs', 'Bunny', '456 Carrot Blvd.', 'Hopp', 'UT', '84999', '(555)777-2222', 'buggs@hoppy.com']
Row:  ['567890', 'Space', 'Ghost', '999 Space Lane', 'Orbit', 'UT', '84333', '(666)777-8888', 'ghostie@rocket.com']
Row:  ['678901', 'Yogi', 'Bear', '454 Bear Blvd.', 'Bearville', 'UT', '84524', '(777)888-9999', 'yogi@bear.net']
Row:  ['789012', 'Fred', 'Flintstone', '825 Rock Street', 'Bedrock', 'UT', '84846', '(888)999-0000', 'freddy@stone.com']
Row:  ['890123', 'Scooby', 'Doo', '444 Snacks Street', 'Doggo', 'UT', '84000', '(999)111-2222', 'scoob@snackers.com']
Row:  ['901234', 'Mickey', 'Mouse', '356 Squeek Lane', 'Cheese', ' UT', '84567', '(111)222-3333', 'mick@mouse.com']
Row:  ['466577', 'Charlie', 'Brown', '987 Snoopy Street', 'Chuck', 'UT', '84575', '(234)645-7737', 'chuck@peanuts.com']

466577  Charlie Brown
        987 Snoopy Street
        Chuck, UT 84575
        (234)645-7737
        chuck@peanuts.com

Code Details



 





© 2023 John Gordon
Cascade Street Publishing, LLC