You are on page 1of 20

SciServerLab_session2_2023-24_StudentCopy

November 21, 2023

1 Galaxies and the large-scale structure of the Universe


A python exercise notebook written by Rita Tojeiro, October 2017; revised in October 2023. This
notebook has benefited from examples provided by Britt Lundgren (University of North Carolina)
and Jordan Raddick (John Hopkins University).
In this Notebook, you will use data from the Sloan Digital Sky Survey (SDSS), to explore the
relationship between galaxy properties and the large-scale structure of the Universe.
In the end, you should have found an answer to the following questions:
• How are galaxies spatially distributed in the Universe?
• Are galaxies all the same colour?
• Are galaxies all the same shape?
• How are galaxies’ colours and shapes related to their spacial distribution?

1.1 SDSS and SciServer


You will answer the above questions yourself, by exploring the largest astronomical dataset in the
world - the Sloan Digital Sky Survey (www.sdss.org).
You will interact directly with the data using SciServer. SciServer is a cloud-based computing
service, that allows users to query the SDSS database, store data in the cloud, and offers seamless
integration with python programming tools via notebooks. In practice, this means that you can
access and manipulate the largest astronomical dataset in the world, using only a web browser.
If you’re reading this, you have already followed the instructions to get an account on SciServer,
and have uploaded this notebook. This exercise has also assumed that you have completed the
previous Notebook: “Introduction to Python and SDSS spectroscopic data”.

1.2 This Lab


This notebook guides you through your exploration of the SDSS data. You will be given a rough
outline and some examples. As before, empty code cells will be provided for you complete the
exercises. Always make sure you execute the code cells above an exercise before attempting to
solve it.
At the end of your session, you will need to print your notebook to a PDF for Moodle submission.
This will automatically print all of your code, plots and other outputs. To print, choose File-
>Print Preview. That will open another tab in your browser, with a print-friendly version of your
notebook. Make sure you have answered all exercises.

1
1.3 Assessment
As a reminder, assessment for the Astro Lab is via a Moodle quiz. The quiz will open and close on
the day of your second Lab session. You will need to have your two notebooks accessible
to answer the quiz. The quiz itself shouldn’t take more than 10 minutes - you will have finished
the lab by the time you sit the quiz, so you will have already done all of the hard work!

1.3.1 Imports
Firstly, we will import the necessary SciServer and support libraries.
[1]: # Import Python libraries to work with SciServer
import SciServer.CasJobs as CasJobs # query with CasJobs
import SciServer.SciDrive as SciDrive # read/write to/from SciDrive
import SciServer.SkyServer as SkyServer # show individual objects and␣
,→generate thumbnail images through SkyServer

print('SciServer libraries imported')

# Import other libraries for use in this notebook.


import numpy as np # standard Python lib for math ops
#from scipy.misc import imsave # save images as files
import pandas # data manipulation package
import matplotlib.pyplot as plt # another graphing package
import os # manage local files in your Compute␣
,→containers

print('Supporting libraries imported')

# Apply some special settings to the imported libraries


# ensure columns get written completely in notebook
pandas.set_option('display.max_colwidth', -1)
# do *not* show python warnings
import warnings
warnings.filterwarnings('ignore')
print('Settings applied')

SciServer libraries imported


Supporting libraries imported
Settings applied
<ipython-input-1-238254f28e77>:18: FutureWarning: Passing a negative integer is
deprecated in version 1.0 and will not be supported in future version. Instead,
use None to not limit the column width.
pandas.set_option('display.max_colwidth', -1)

1.4 Querying the SDSS database


The SDSS data is stored in a SQL database. SQL is a languased used to communicate with
databases via “queries”. For each query command, the database returns an answer. Usually, this is

2
a subsample of the original database, though SQL can operate on the data very effectively too. In
this tutorial we will submit queries to the SDSS database to gather the information that we need,
and we will use python to operate on, manipulate, and vizualise that data.
An extensive tutorial on how to query the SDSS database is provided here:
http://skyserver.sdss.org/dr14/en/help/howto/search/searchhowtohome.aspx . In short, ev-
ery SQL command consists of three blocks: - The SELECT block: it defines the quantities that
you want your query to return. - The FROM block: it defines which tables of the database you
want SQL to look in. - The WHERE block: it defines any constraints on the data that you want
to impose.
In this Lab you won’t have to write SQL queries from scratch, only execute commands that are
already written for you.

1.4.1 Using SQL and SciServer to return galaxy data


For the database schema and documentation see http://skyserver.sdss.org/dr14/en/help/browser/browser.aspx
The following query returns specific information on a sample of galaxies, as a dataframe. Execute
the next cell.
[2]: # Find objects in the Sloan Digital Sky Survey's Data Release 14.
#
# Query the Sloan Digital Sky Serveys' Data Release 14.
# For the database schema and documentation see http://skyserver.sdss.org/dr14
#
# This query finds all galaxies with a size (petror90_r) greater than 10␣
,→arcseconds, within

# a region of sky with 100 < RA < 250, a redshift between 0.02 and 0.1, and a␣
,→g-band magnitude brighter than 17.

#
# First, store the query in an object called "query"
query="""
SELECT p.objId,p.ra,p.dec,p.petror90_r, p.expAB_r,
p.dered_u as u, p.dered_g as g, p.dered_r as r, p.dered_i as i,
s.z, s.plate, s.mjd, s.fiberid
FROM galaxy AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.petror90_r > 10
and p.ra between 100 and 250
and s.z between 0.02 and 0.1
and p.g < 17
"""
#Then, query the database. The answer is a table that is being returned to a␣
,→dataframe that we've named all_gals.

all_gals = CasJobs.executeQuery(query, "dr16")

print("SQL query finished.")

SQL query finished.

3
The dataframe that is returned, which we named all_gals, holds the following quantities (in separate
columns) for each galaxy:
• ra = Right Ascencion coordinate in degrees
• dec = Declination coordinate in degrees
• petror90_r = Radius enclosing 90% of the pertrosian flux in arcseconds. I.e., size of the
galaxy on the sky.
• dered_u, dered_g, dered_r, dered_i = Magnitudes in 4 optical filters, from the blue to the
red, after subtracting the attenuation due to the Milky Way.
• z = Redshift of the galaxy
• plate = Plate number (SDSS used alluminium plates with drilled holes for positioning optical
fibers).
• mjd = Date of the observation
• fiberid = Number of the fiber in a given plate. Plates have between 640 and 1000 fibers.
Each row on the dataframe corresponds to one galaxy.
You can inspect the first 10 elements of your dataframe (i.e., the first 10 galaxies) with:

[3]: all_gals.loc[0:10]

[3]: objId ra dec petror90_r expAB_r u \


0 1237648721254481977 234.238034 -0.087819 11.57409 0.704878 17.49338
1 1237648721254875626 235.169008 -0.159319 18.38309 0.518115 17.23740
2 1237648721255202826 235.836707 -0.193186 11.26388 0.859128 18.60228
3 1237648721255792721 237.202220 -0.000421 38.75308 0.467556 17.46235
4 1237648721746657464 132.190120 0.291519 15.24428 0.739572 17.38089
5 1237648721746788826 132.506694 0.225775 12.61026 0.938516 17.82496
6 1237648721746854319 132.632067 0.319063 13.46899 0.858372 17.16061
7 1237648721746919458 132.711262 0.373291 17.92269 0.831062 17.03832
8 1237648721746919659 132.790813 0.253595 12.61691 0.837935 16.79841
9 1237648721747247252 133.494521 0.350043 12.93189 0.334808 17.38333
10 1237648721749147894 137.786983 0.311405 12.83611 0.218255 17.97545

g r i z plate mjd fiberid


0 15.75133 15.03095 14.66924 0.039036 315 51663 279
1 15.46671 14.65872 14.23416 0.078301 315 51663 180
2 16.67542 15.78932 15.39479 0.079167 315 51663 30
3 15.51686 14.55512 14.13725 0.095838 342 51691 157
4 15.53753 14.75403 14.38051 0.052994 467 51901 147
5 16.04960 15.26313 14.89644 0.051329 467 51901 107
6 15.44577 14.64965 14.25556 0.053292 468 51912 267
7 15.16663 14.31446 13.92563 0.052091 467 51901 61
8 14.99341 14.18889 13.81894 0.040880 468 51912 222
9 15.95697 15.23548 14.83383 0.028156 468 51912 172
10 16.45066 15.68234 15.23371 0.070201 472 51955 384

4
2 And, for example, access certain galaxies individually:
[4]: all_gals.loc[30]

[4]: objId 1.237649e+18


ra 1.502984e+02
dec 2.256899e-01
petror90_r 2.854380e+01
expAB_r 9.732183e-01
u 1.618858e+01
g 1.433833e+01
r 1.353901e+01
i 1.313791e+01
z 3.264382e-02
plate 2.680000e+02
mjd 5.163300e+04
fiberid 5.910000e+02
Name: 30, dtype: float64

And specific properties of a given galaxy like so:


[4]: all_gals.loc[30]['dec']

[4]: 0.225689876893376

2.0.1 Exercise 1:
How many galaxies does your dataframe hold?
[5]: 37951

[5]: 37951

2.1 The large scale structure of the Universe


2.1.1 Exercise 2:
1. Plot the positions of all galaxies usint plt.scatter(). Remember to add labels and a title to
your plot. Given the large number of points, you might want to use marker=‘.’ and s=1.
2. What can you tell from the distribution of galaxies? Are they uniformly distributed on the
sky?
[8]: plt.figure(figsize=(10,8))
plt.scatter(all_gals['ra'],all_gals['dec'], marker='.', s=1, color='blue')
plt.xlabel('right ascention', fontsize=20)
plt.ylabel('declination', fontsize=20)
plt.title('Position of galaxies')

5
[8]: Text(0.5, 1.0, 'Position of galaxies')

2. some areas of the diagram are more densely populated with galaxies than others

2.1.2 Exercise 3:
1) Using the np.where() command, select galaxies in two narrow redshift slices:
• slice 1: 0.02 < z < 0.03 (green)
• slice 2: 0.03 < z < 0.04 (orange)
2) Make the same plot as above, but only using the galaxies in each slice using the suggested
colour scheme (make one plot for each slice).
3) Make a third plot with galaxies from both redshift slices.
Remember to add axis labels, a title and a legend to each plot.
[9]: slice_1 = np.where((0.02<all_gals['z'])&(all_gals['z']<0.03))[0]

[10]: slice_2 = np.where((0.03<all_gals['z'])&(all_gals['z']<0.04))[0]

6
[11]: plt.figure(figsize=(10,8))
plt.scatter(all_gals.loc[slice_1]['ra'],all_gals.loc[slice_1]['dec'], marker='.
,→', s=1, color='green')

plt.xlabel('right ascention', fontsize=20)


plt.ylabel('declination', fontsize=20)
plt.title('Position of galaxies')

plt.figure(figsize=(10,8))
plt.scatter(all_gals.loc[slice_2]['ra'],all_gals.loc[slice_2]['dec'], marker='.
,→', s=1, color='orange')

plt.xlabel('right ascention', fontsize=20)


plt.ylabel('declination', fontsize=20)
plt.title('Position of galaxies')

plt.figure(figsize=(10,8))
plt.scatter(all_gals.loc[slice_1]['ra'],all_gals.loc[slice_1]['dec'], marker='.
,→', s=1, color='green' ,label='slice_1')

plt.scatter(all_gals.loc[slice_2]['ra'],all_gals.loc[slice_2]['dec'], marker='.
,→', s=1, color='orange' ,label='slice_2')

plt.xlabel('right ascention', fontsize=20)


plt.ylabel('declination', fontsize=20)
plt.title('Position of galaxies')
plt.legend()

[11]: <matplotlib.legend.Legend at 0x7fb17e6f7cd0>

7
8
9
2.1.3 Exercise 4:
Do you see more structure in the distribution of galaxies in each slice, when compared to your first
plot that included all galaxies?
What can you tell about the structure you see in the two different redshift slices?
Why was it harder to see in your first plot, where you included all galaxies?
slice 1 is more densley populated near the middle of the graph with a lower ra than slice 2’s more
densley populated area. Slice 2s posiitions are more commonly further away which correlates to
the higher redshift value for slice 2.

2.2 Galaxy colours


You will see in lectures that the optical colours of galaxies are related to the age of their stars -
red galaxies hold older stars, whereas blue galaxies tend to have younger stars. In practice, we can
quantify “colour” in Astronomy as the difference in magnitude in two different bands.
The final exercises in the first notebook (SciServerLab_session1) give you a demonstration of how
colours work in practice. We didn’t consider it in the first notebook, but redshift can also affect

10
the observed colour of galaxies (you will learn this in the later lectures, if you haven’t yet).
In this set of exercises we will focus on the first slice in redshift, which is very narrow, meaning
that all galaxies have a similar redshift. I.e., if galaxies in this redshift slice have different colours,
it ought to be because their spectra and stellar composition are different, and not because some are
redshifted due to the expansion of the Universe.
The following cell plots a histogram of the values of the u-g colour of the galaxies in your dataframe:
[12]: slice1 = np.where( (all_gals['z'] > 0.02) & (all_gals['z'] < 0.03))[0]

plt.hist(all_gals.loc[slice1]['u']-all_gals.loc[slice1]['g'], bins=40, range=(0.


,→5,2.5))

plt.xlabel('u-g')
plt.ylabel('Number of galaxies')
plt.title('Distribution of u-g color in 0.02 < z < 0.03')

[12]: Text(0.5, 1.0, 'Distribution of u-g color in 0.02 < z < 0.03')

np.percentile() (https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html)
allows you to quickly return the percentile of a distribution of points. For example, to find the
median (50th percentile) u-g colour of your galaxy population you can write:

[13]: median_umg = np.percentile(all_gals.loc[slice1]['u']-all_gals.loc[slice1]['g'],␣


,→50) #a value of u-g

11
print(median_umg)

1.4351650000000005
i.e., 50% of the galaxies in your sample have u-g colours that are lower than 1.435 (i.e., they are
bluer than the median), and 50% have u-g colours that are larger (i.e., they are redder than the
median). If I wanted to choose only the 10% reddest galaxies I could do:

[14]: high_umg = np.percentile(all_gals.loc[slice1]['u']-all_gals.loc[slice1]['g'],␣


,→90)

very_red_galaxies = np.where((all_gals['z'] > 0.02) & (all_gals['z'] < 0.03) &␣


,→(all_gals['u']-all_gals['g'] > high_umg))[0]

high_umg

[14]: 1.8633340000000014

2.2.1 Exercise 5:
Following the example above, use np.percentile() to choose the 25% reddest and 25% bluest
galaxies in u-g. Then plot their positions on the sky. Do both types of galaxies trace the large-
scale structure in a similar way? What can you say about which galaxies preferencially sit on
denser parts of the Universe, and which sit on less dense regions (we call this environment)? For
this exercise it is recommended that you make two plots (one for the red galaxies, and one for the
blue), so that it is easier to compare. You may use as many cells as needed.

[15]: high_umg = np.percentile(all_gals.loc[slice1]['u']-all_gals.loc[slice1]['g'],␣


,→75)

very_red_galaxies = np.where((all_gals['z'] > 0.02) & (all_gals['z'] < 0.03) &␣


,→(all_gals['u']-all_gals['g'] > high_umg))[0]

low_umg = np.percentile(all_gals.loc[slice1]['u']-all_gals.loc[slice1]['g'], 25)


very_blue_galaxies = np.where((all_gals['z'] > 0.02) & (all_gals['z'] < 0.03) &␣
,→(all_gals['u']-all_gals['g'] < high_umg))[0]

[16]: plt.figure(figsize=(10,8))
plt.scatter(all_gals.loc[very_red_galaxies]['ra'],all_gals.
,→loc[very_red_galaxies]['dec'], marker='.', s=1, color='green'␣

,→,label='slice_1')

plt.xlabel('right ascention', fontsize=20)


plt.ylabel('declination', fontsize=20)
plt.title('Position of galaxies')
plt.figure(figsize=(10,8))
plt.scatter(all_gals.loc[very_blue_galaxies]['ra'],all_gals.
,→loc[very_blue_galaxies]['dec'], marker='.', s=1, color='orange'␣

,→,label='slice_2')

plt.xlabel('right ascention', fontsize=20)


plt.ylabel('declination', fontsize=20)
plt.title('Position of galaxies')

12
[16]: Text(0.5, 1.0, 'Position of galaxies')

13
By now you will have started developing an understanding of how galaxies in general are spacially
distributed in the Universe and the shape of the cosmic web, and how galaxies’ position on the
cosmic web and their environment is related to their colour. Next, we will look at the shape of
galaxies.

2.3 Galaxy morphology


Galaxy morphology studies the shapes of galaxies. You will already have some understanding of
how local galaxies look like, from your exploration of SDSS imaging in the first Lab session using
the SDSS SkyServer Navigate Tool.
Here, we will do a more systematic exploration of how galaxy shapes are related to other properties.
The next cell provides a bit of code that selects 16 random galaxies from a dataframe, and shows
you their optical images. Execute it.
[17]: def show_galaxy_images(my_galaxies):
#plot a random subset of 16 galaxies
# set thumbnail parameters
width=200 # image width

14
height=200 # height
pixelsize=0.396 # image scale
plt.figure(figsize=(15, 15)) # display in a 4x4 grid
subPlotNum = 1

i = 0
nGalaxies = 16 #Total number of galaxies to plot
ind = np.random.randint(0,len(my_galaxies), nGalaxies) #randomly selected␣
,→rows

count=0
for i in ind: # iterate through the randomly selected rows in the␣
,→DataFrame

count=count+1
print('Getting image '+str(count)+' of '+str(nGalaxies)+'...')
if (count == nGalaxies):
print('Plotting images...')
scale=2*all_gals.loc[i]['petror90_r']/pixelsize/width
img = SkyServer.getJpegImgCutout(ra=all_gals.loc[my_galaxies[i]]['ra'],␣
,→dec=all_gals.loc[my_galaxies[i]]['dec'], width=width, height=height,␣

,→scale=scale,dataRelease='DR14')

plt.subplot(4,4,subPlotNum)
subPlotNum += 1
plt.imshow(img) # show images in grid
plt.title(all_gals.loc[my_galaxies[i]]['z'])

You can use the function defined above to plot 16 random galaxies from any dataframe. For
example, to plot 16 galaxies randomly selected in a redshift slice 0.02 < z < 0.03 you might do:
[20]: my_galaxies = np.where( (all_gals['z'] > 0.02) & (all_gals['z'] < 0.03))[0]
print(my_galaxies)
show_galaxy_images(my_galaxies)

[ 9 19 21 … 37923 37924 37930]


Getting image 1 of 16…
Getting image 2 of 16…
Getting image 3 of 16…
Getting image 4 of 16…
Getting image 5 of 16…
Getting image 6 of 16…
Getting image 7 of 16…
Getting image 8 of 16…
Getting image 9 of 16…
Getting image 10 of 16…
Getting image 11 of 16…
Getting image 12 of 16…
Getting image 13 of 16…

15
Getting image 14 of 16…
Getting image 15 of 16…
Getting image 16 of 16…
Plotting images…

2.3.1 Exercise 6:
Compute the fraction of galaxies you’d classify as having disks, and the fraction of galaxies you’d
classify as being smooth ellipsoids. If you want to improve your statistics, you can rerun the cell
above and you will get 16 different galaxies every time…
[ ]: I would say galaxies 2,3,4,5,7,8,11,13,15,16 have disks
Galaxies 1,6,9,10,14 are smooth ellipsoids and I am inconclusive on galaxy 12␣
,→because of its size

16
Answer here (double-click to edit):

2.3.2 Exercise 7:
Now starting from the code given in the example above (copy it and paste it onto the cell below),
do the same thing but taking 16 random galaxies that are red, according to your earlier definition
of red and blue. Once again, classify the galaxies as disks or ellipticals. Note, after copying and
pasting, you only need to change the first line, that defines my_galaxies.
[21]: my_galaxies = np.where((all_gals['z'] > 0.02) & (all_gals['z'] < 0.03) &␣
,→(all_gals['u']-all_gals['g'] > high_umg))[0]

print(my_galaxies)
show_galaxy_images(my_galaxies)

[ 19 66 117 … 37758 37793 37880]


Getting image 1 of 16…
Getting image 2 of 16…
Getting image 3 of 16…
Getting image 4 of 16…
Getting image 5 of 16…
Getting image 6 of 16…
Getting image 7 of 16…
Getting image 8 of 16…
Getting image 9 of 16…
Getting image 10 of 16…
Getting image 11 of 16…
Getting image 12 of 16…
Getting image 13 of 16…
Getting image 14 of 16…
Getting image 15 of 16…
Getting image 16 of 16…
Plotting images…

17
[ ]: disk : 5/16
elliptical: 11/16

2.3.3 Exercise 8:
Repeat the above exercise, now with blue galaxies. Repeat your classification exercise.
[23]: my_galaxies = np.where((all_gals['z'] > 0.02) & (all_gals['z'] < 0.03) &␣
,→(all_gals['u']-all_gals['g'] < high_umg))[0]

print(my_galaxies)
show_galaxy_images(my_galaxies)

[ 9 21 32 … 37923 37924 37930]


Getting image 1 of 16…

18
Getting image 2 of 16…
Getting image 3 of 16…
Getting image 4 of 16…
Getting image 5 of 16…
Getting image 6 of 16…
Getting image 7 of 16…
Getting image 8 of 16…
Getting image 9 of 16…
Getting image 10 of 16…
Getting image 11 of 16…
Getting image 12 of 16…
Getting image 13 of 16…
Getting image 14 of 16…
Getting image 15 of 16…
Getting image 16 of 16…
Plotting images…

19
[ ]: elliptical : 3/16
sprial : 13/16

2.3.4 Exercise 9:
From the above exercise, what can you say - if anything - about the relationship between colour
and morphology?
Answer here (double click to edit):
Congratulations, that is the end of the Lab! Make sure you’ve run all the code cells,
filled in all the text answers and that your plots are all showing without error. Print
to PDF, and submit to Moodle by the deadline. This account on SciServer is yours to keep,
and you’re welcome to explore further at any time. If you do, and you ever need some guidance, I
would be more than happy to help.
[ ]: Bluer galaxies are more likely to contain disks and be spiral-like
Redder galaxies are less likely to contain disks and clasify as ellipsoids

20

You might also like