Table Of Contents

Previous topic

Reading and writing files

Next topic

Plotting and images

This Page

Several ways to handle ASCII data

Astronomers love storing tabular data in human-readable ASCII tables. Unfortunately there is very little agreement on a standard way to do this, unlike e.g. FITS.

Reading and Writing files in Pure Python

Reading

We have already talked about python-built-in-types-and-operations, but there are more types that we did not speak about. One of these is the file() object which can be used to read or write files.

Let’s start off by downloading this data file, then launching IPython the directory where you have the file:

$ ipython --pylab

If you have trouble downloading the file, then start up IPython (ipython --pylab) and enter:

import urllib2
url = 'http://python4esac.github.com/_downloads/data.txt'
open('data.txt', 'wb').write(urllib2.urlopen(url).read())
ls

Now let’s try and get the contents of the file into IPython. We start off by creating a file object:

In [1]: f = open('data.txt', 'r')

The 'r' means that the file should be opened in read mode (i.e. you will get an error if the file does not exist). Now, simply type:

In [2]: print(f.read())
RAJ        DEJ                          Jmag   e_Jmag
2000 (deg) 2000 (deg) 2MASS             (mag)  (mag) 
---------- ---------- ----------------- ------ ------
010.684737 +41.269035 00424433+4116085   9.453  0.052
010.683469 +41.268585 00424403+4116069   9.321  0.022
010.685657 +41.269550 00424455+4116103  10.773  0.069
010.686026 +41.269226 00424464+4116092   9.299  0.063
010.683465 +41.269676 00424403+4116108  11.507  0.056
010.686015 +41.269630 00424464+4116106   9.399  0.045
010.685270 +41.267124 00424446+4116016  12.070  0.035

The data file has been read in as a single string. Let’s try that again:

In [3]: f.read()
 Out[3]: ''

What’s happened? We read the file, and the file ‘pointer’ is now sitting at the end of the file, and there is nothing left to read. Let’s now try and do something more useful, and capture the contents of the file in a string:

In [4]: f = open('data.txt', 'r')  # We need to re-open the file

In [5]: data = f.read()

In [6]: f.close()

Now data contains a string:

In [7]: print(type(data))
<type 'str'>

Closing files

Usually, you should close file when you are done with it to free up resources (memory). If you only have a couple of files in an interactive session, that is not dramatic. On the other hand, if you write scripts which deal with dozens of files, then you should start worrying about these things. Often you will see things like this:

with open('data.txt', 'r') as f:
    # do things with your file
    data = f.read()

type(data)

Notice the indent under the with. At the end of that block the file is automatically closed, even if things went wrong and an error occured inside the with block.

Reading the entire file at once limits us to reading files that with fit into our computer’s memory. What we’d really like to do is read the file line by line. There are several ways to do this, the simplest of which is to use a for loop in the following way:

In [8]: f = open('data.txt', 'r')

In [9]: for line in f:
   ...:     print(repr(line))
   ...: 
'RAJ        DEJ                          Jmag   e_Jmag\n'
'2000 (deg) 2000 (deg) 2MASS             (mag)  (mag) \n'
'---------- ---------- ----------------- ------ ------\n'
'010.684737 +41.269035 00424433+4116085   9.453  0.052\n'
'010.683469 +41.268585 00424403+4116069   9.321  0.022\n'
'010.685657 +41.269550 00424455+4116103  10.773  0.069\n'
'010.686026 +41.269226 00424464+4116092   9.299  0.063\n'
'010.683465 +41.269676 00424403+4116108  11.507  0.056\n'
'010.686015 +41.269630 00424464+4116106   9.399  0.045\n'
'010.685270 +41.267124 00424446+4116016  12.070  0.035\n'

Notice the indent before print, which is necessary to indicate that we are inside the loop (there is no end for in Python). Note that we are using repr() to show any invisible characters (this will be useful in a minute).

Each line is being returned as a string. Notice the \n at the end of each line - this is a line return character, which indicates the end of a line.

Note

You may also come across the following way to read files line by line:

for line in f.readlines():
    ...

f.readlines() actually reads in the whole file and splits it into a list of lines, so for large files this can be memory intensive. Using:

for line in f:
    ...

instead is more memory efficient because it only reads one line at a time.

Now we’re reading in a file line by line, what would be really nice would be to get some values out of it. Let’s examine the last line in detail. If we just type line we should see the last line that was printed in the loop:

In [10]: line
Out[10]: '010.685270 +41.267124 00424446+4116016  12.070  0.035\n'

We can first get rid of the \n character with:

In [11]: line = line.strip()

In [12]: line
Out[12]: '010.685270 +41.267124 00424446+4116016  12.070  0.035'

Actually, the strip method removes any whitespace (i.e. spaces, tabs and return characters) in the beginning and end of the string. If you explicitly only want to strip whitespace in the beginning or end, use lstrip or rstrip, and if you only want to strip a specific character, use e.g. strip('\n').

Next, we can use what we learned about strings and lists to do:

In [13]: columns = line.split()

In [14]: columns
Out[14]: ['010.685270', '+41.267124', '00424446+4116016', '12.070', '0.035']

Finally, let’s say we care about the source name, and the J band magnitude. We can extract these with:

In [15]: name = columns[2]

In [16]: j = columns[3]

In [17]: name
Out[17]: '00424446+4116016'

In [18]: j
Out[18]: '12.070'

Note that j is a string, but if we want a floating point number, we can instead do:

In [19]: j = float(columns[3])

One last piece of information we need about files is how we can read a single line. This is done using:

In [20]: line = f.readline()

We can put all this together to write a little script to read the data from the file and display the columns we care about to the screen!:

# Open file
f = open('data.txt', 'r')

# Read and ignore header lines
header1 = f.readline()
header2 = f.readline()
header3 = f.readline()

# Loop over lines and extract variables of interest
for line in f:
    line = line.strip()
    columns = line.split()
    name = columns[2]
    j = float(columns[3])
    print(name, j)

f.close()

The output should look like this:

00424433+4116085 9.453
00424403+4116069 9.321
00424455+4116103 10.773
00424464+4116092 9.299
00424403+4116108 11.507
00424464+4116106 9.399
00424446+4116016 12.07

Exercise

Try and see if you can understand what the following script is doing:

f = open('data.txt', 'r')
header1 = f.readline()
header2 = f.readline()
header3 = f.readline()
data = []
for line in f:
    line = line.strip()
    columns = line.split()
    source = {}
    source['name'] = columns[2]
    source['j'] = float(columns[3])
    data.append(source)

After this script is run, how would you access the name and J-band magnitude of the third source?

Click to Show/Hide Solution

The following line creates an empty list to contain all the data:

data = []

For each line, we are then creating an empty dictionary and populating it with variables we care about:

source = {}
source['name'] = columns[2]
source['j'] = float(columns[3])

Finally, we append this source to the data list:

data.append(source)

Therefore, data is a list of dictionaries:

>>> data
[{'j': 9.453, 'name': '00424433+4116085'},
 {'j': 9.321, 'name': '00424403+4116069'},
 {'j': 10.773, 'name': '00424455+4116103'},
 {'j': 9.299, 'name': '00424464+4116092'},
 {'j': 11.507, 'name': '00424403+4116108'},
 {'j': 9.399, 'name': '00424464+4116106'},
 {'j': 12.07, 'name': '00424446+4116016'}]

You can access the dictionary for the third source with:

>>> data[2]
{'j': 10.773, 'name': '00424455+4116103'}

To get the name of this source, you can therefore do:

>>> data[2]['name']
'00424455+4116103'

Writing

To open a file for writing, use:

In [21]: f = open('data_new.txt', 'wb')

Then simply use f.write() to write any content to the file, for example:

In [22]: f.write("Hello, World!\n")

If you want to write multiple lines, you can either give a list of strings to the writelines() method:

In [23]: f.writelines(['spam\n', 'egg\n', 'spam\n'])

or you can write them as a single string:

In [24]: f.write('spam\negg\nspam')

To close a file, simply use:

In [25]: f.close()

(this also applies to reading files)

Exercise

Let’s try combining reading and writing. Using at most seven lines, write a script which will read in data.txt, replace any spaces with periods (.), and write the result out to a file called data_new.txt.

Can you do it in a single line? (you can ignore closing the file)

Click to Show/Hide Solution

Here is a possible solution:

f1 = open('data.txt', 'r')
content = f1.read()
f1.close()

content = content.replace(' ','.')

f2 = open('data_new.txt', 'w')
f2.write(content)
f2.close()

And a possible one-liner!:

open('data_new.txt', 'w').write(open('data.txt', 'r').read().replace(' ', '.'))

Even though one-liners like this are possible in Python, they are generally considered poor style because they are much less readable than the longer version above.

Numpy

Numpy provides two functions to read in ASCII data. np.loadtxt is meant for relatively simple tables without missing values:

from StringIO import StringIO   # Pretends your variable is really a file
                                # because loadtxt expect a filename as input
c = StringIO("0 1\n2 3")
np.loadtxt(c)

Here is a more complicated example, that is actually useful:

d = StringIO('''
# Abundances of different elements
# for TW Hya
# taken from Guenther et al. (2007)
# element, abund, error, first-ionisation-potential [eV]
C  0.2  0.03 11.3
N  0.51 0.05 14.6
O  0.25 0.01 13.6
Ne 2.46 0.08 21.6
Fe 0.19 0.01  7.9
''')
data = np.loadtxt(d, dtype={'names': ('elem', 'abund', 'error', \
    'FIP'),'formats': ('S2', 'f4', 'f4', 'f4')})

plt.errorbar(data['FIP'], data['abund'], yerr = data['error'], fmt = 'o')

The resulting plot clearly shows the inverse first ionization potential effect. That means, that elements of a large FIP are enhanced in the corona.

The second command np.genfromtxt is more versatile. It can fill missing values in a table, read column names, exclude some columns and guess the data-type of the columns using dtype=None. Here is an example:

d = StringIO('''
#element abund error FIP
C  0.2  0.03 11.3
N  0.51 0.05 14.6
O  0.25 0.01 13.6
Ne 2.46 0.08 21.6
S  nn   nn   10.4
Fe 0.19 0.01  7.9
other elements were not measured
''')
data = np.genfromtxt(d, dtype=None, names = True, \
     skip_footer = 1, missing_values = ('nn'), filling_values=(np.nan))

Examine what was returned:

data

This is an instance of the NumPy structured array type, which is an efficient way to manipulate records of tabular data. It stores columns of typed data and you can access either a column of data or a row of data at once:

data.dtype
data[1]
data['abund']
Copyright: Smithsonian Astrophysical Observatory under terms of CC Attribution 3.0 Creative Commons
 License