Processing many files in Python, plotting and saving to a new/existing folder
During my research, I needed to import, clean, make new columns, make a figure and smoothen each of a 833 different text files. I know this would take forever to do using analysis software or doing it manually on python. In this article, I’ll be showing you how I processed all the data in less than 2 minutes by writing a function that can loop through all the files, create data frame, make new columns containing my desired calculations and plot the desired columns to a standard that is acceptable by journals and publishers.
Now, let’s discuss the code:
- Import the libraries and modules required for the work
import os
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt
from scipy.signal import savgol_filter
import os
helps create and remove folder/directory, acquire its content, identify or change current directory and others
import pandas as pd
helps handle and manipulate data. We are importing as pd to enable us use pd instead of writing pandas each time we need it
from matplotlib import pyplot as plt
helps utilise matplotlib visualization library to plot 2D arrays using different functions that works like MATLAB
from spicy.signal import savgol_filter
is used to smoothen noisy curves, windows to average over and polynomial order will be specified
2. Write a function that takes the directory path and loop through each of a specified format of data contained in the folder to create a dataframe
def process_directory(directory: str):
'''Loop files in the specified directory'''
for filename in os.listdir(directory):
if filename.endswith(".txt"):
file_directory = os.path.join(directory, filename)
print(file_directory)
column_names = ['Wavelength', 'Intensity']
df = pd.read_csv(file_directory, delimiter="\t", names=column_names, header=None)
def process_directory(directory: str):
defines a function that can be called by the name process_directory. Inside the bracket, the path to the folder of interest is entered as a string
for filename in os.listdir(directory)
returns the list of all files in the specified folder as filename for each loop
if filename.endswith(".txt"):
will scan to see if the extension of the filename is txt, that is, it is a textile
file_directory = os.path.join(directory, filename)
will combine the folder path and the filename to get a full pathname to the specific file being processed in that loop
print(directory)
will print each of the directory after each loop, for many files, this might be unnecessary to ensure the ouput area is not congested
column_names = ['Wavelength', 'Intensity']
names the columns contained in the dataset, you can leave this out if the data first line contains the column names. My data first line is part of the data so I had to create the column names as headers
df = pd.read_csv(file_directory, delimiter='\t', names=column_names, header=None)
create a dataframe, using pandas, for the current file being processed. delimiter='\t'
is used to set the delimiter of the data as tab, this could be ommitted if the dataset is clean already, it could also be ‘,’ for comma or others. names=column_names
makes the header the already defined header names in the previous line. header=None
is written to avoid using the first line of data (provided that it is not the header) is not used as the header
3. Data processing using pandas
df['Energy'] = ((4.14e-15)*(3e17))/df.Wavelength
We created a new column named Energy which is calculated from the wavelength in ev.
4. Plotting data using matplotlib
plt.rc('font', family='serif', serif='Times')
plt.rc('xtick', labelsize=22)
plt.rc('ytick', labelsize=22)
plt.rc('axes', labelsize=22) plt.rcParams['font.size'] = 22
plt.rcParams['axes.linewidth'] = 2 width = 13
height = width / 1.5 fig, ax = plt.subplots()
fig.subplots_adjust(left=.15, bottom=.16, right=.99, top=.97) x = df.Energy
y = df.Intensity
y2 = savgol_filter(y, 301, 3)
plt.plot(x2, y2/1000, color='red', lw=4)
ax.set_xlabel('Energy (ev)', labelpad=18, fontsize=22)
ax.set_ylabel('Intensity (a.u)', labelpad=18, fontsize=22)
ax.set_ylim(-0.1, 65)#(df.Intensity.max() + 500)/1000)
fig.set_size_inches(width, height)
plt.rc('font',family='serif',serif='Times')
defines the font family and the style of the font to be used for the labels
plt.rc('xtick',labelsize=22) and plt.rc('ytick',labelsize=22)
sets the fontsize of the tick-labels, that is the numbers on the x and y axes. plt.rc(axes,labelsize=22)
sets the font size of the axes labels to 22
plt.rcParams['font.size'] = 22
and plt.rcParams['axes.linewidth']=2
sets the fonts of all elements on the plot to 22 and makes the linewidth 2
savgol_filter was used to smoothen the plot. The data of the smoothened curve was plotted against the energy calculate in the dataframe
ax.set_xlabel('Energy (ev)', labelpad=18, fontsize=22)
and ax.set_ylabel('Intensity (a.u)', labelpad=18, fontsize=22)
sets the name of the vertical and horizontal axes, and sets the pad as 18.
ax.set_ylim(-0.1,65) #(df.Intensity.max()+500)/1000))
places a limit on the y axes. In this data, I want a contant labels running from -0.1 to 65, if i want the labels to grow depending on the maximum number on the y-axes, I can delete 65) and uncomment the folowing line of code
fig.set_size_inches(width, height)
sets the width and height of the plot in inches according to an already defined variables of width and height
5. Create a new directory to save all figure or just save in that directory if it is already existing
results_dir = os.path.join(directory, 'Spectra Figures/')
if not os.path.isdir(results_dir):
os.makedirs(results_dir)
save_name = filename.rsplit( ".", 1 )[ 0 ] + '.jpeg'
plt.savefig(results_dir + save_name, dpi=600, transparent=False, bbox_inches='tight')
results_dir = os.path.join(directory, 'Spectra Figures/')
creates a pathname with a proposed folder named Spectra Figures
if not os.path.isdir(results_dir):
checks if the directory is not existing yet, then it creates os.makedirs(results_dir)
with the name specified in the results_dir pathname. If it exists already, nothin happens
save_name = filename.rsplit(".", 1)[0] + '.jpeg'
will split the filename acquired earlier into the spring part and the extension, since want to change the extension, we will pick the first part and add jpeg.
plt.savefig(results_dir+save_name, dpi=600, transparent=False, bbox_inches='tight')
saves the figure in the folder created as the name specified in the created pathname contained in results_dir. The dpi is set to 600, could be 300 depending on the quality desired. Transparency is set as false and bbox_inches is used to ensure there are no white lines beside and above what we plotted.
The loop continues until the number of files in the folder is processed.
Emmanuel Bamidele is a doctoral student of Materials Science and Engineering Program, at the University of Colorado, Boulder, USA. He enjoys working on data science, machine learning, computational materials science and materials design using computational and experimental methods