You are on page 1of 22

Mining Software Repositories

What to do? And where to get data?

Israel Herraiz <herraiz@uax.es>


Universidad Alfonso X el Sabio
th

June 18 2010

http://www.uax.es

Outline
1. What is Mining Software Repositories? What
are repositories?
2. Conferences and journals of interest

And some words about trending topics

3. Tools for Mining Software Repositories


4. Datasets for Mining Software Repositories

For replicable and verifiable empirical studies

http://www.uax.es

1. What is Mining Software


Repositories?

http://www.uax.es

What is Mining Software


Repositories?

MSR analyzes the rich data available in


software repositories to uncover interesting and
actionable information about software systems
and projects.
Popular topic since 2004

MSR workshop, colocated with ICSE

Working Conference since 2008

http://www.uax.es

What are repositories?

Anything that leaves a trail about any software


development or maintenance activities

Also includes any software artifact

Tipically

Version control systems

Bug tracking systems

Public communication tools (mailing lists)

http://www.uax.es

Differences between artifact and


repository
hello.c

#include <stdio.h>
int main() {
printf(Hello world);
return 0;
}

Artifact
Source code file

http://www.uax.es

hello.c.diff
- printf(Hello world);
+ printf(Hello world\n);

Author: rms
Date: 20100618 04:34 UTC
Change: +1 -1
Log: Forgot to add new line

Repository
Change to an artifact
Meta-information

2. Conferences and journals of


interest

http://www.uax.es

Working conferences of interest


Deadlines
IEEE Int. Working Conf.
Mining Software
Repositories
(MSR)

January
(Februray
for the challenge)

Accept rate
19% (2008)
31% (2010)

Journal possib.
EMSE
IEEE TSE

http://msr.uwaterloo.ca

IEEE Int. Working Conf.


Source Code Analysis &
Manipulation
(SCAM)
http://www.ieee-scam.org

http://www.uax.es

April

26% (2007)
38% (2008)
45% (2009)

JSS
SCP

Conferences of interest
IEEE Int. Conf. Software
Maintenance (ICSM)

Deadlines

Accept rate

Journal possib.

April

21% (2007)
26% (2008)
22% (2009)

No
special
issues

http://icsm2010.upt.ro/

IEEE Int. Conf. Software


Engineering (ICSE)

August
September
http://www.sbs.co.za/ICSE2010/

Empirical Software Eng. &


Measurement (EMSE)

March

http://www.esem-conferences.org/

http://www.uax.es

15% (2008)
12% (2009)
14% (2010)

No
special
issues

EMSE

Other interesting conferences

Working Conference on Reverse Engineering


(WCRE)

International Conference on Predictive Models


and Software Engineering (PROMISE)

http://web.soccerlab.polymtl.ca/wcre2010/

http://promisedata.org/

European Conference on Software Mainteance


and Re-engineering (CSMR)

http://www.sait.escet.urjc.es/csmr2010/

http://www.uax.es

Journals of interest

IEEE Transactions on Software Engineering (TSE)

ACM Transactions on Software Engineering and Methodology (TOSEM)

http://www.springerlink.com/content/1382-3256

Journal of Systems and Software (JSS)

http://tosem.acm.org/

Empirical Software Engineering (EMSE)

http://www.computer.org/tse/

http://www.elsevier.com/locate/jss

Journal of Software Maintenance and Evolution (JSME)

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-SMR.html

http://www.uax.es

Handy links

Software Engineering Conferences

Verification, Formal Methods, Programming Lang.


and Compilers, Web, Security
http://people.engr.ncsu.edu/txie/seconferences.htm

Upcoming Software Engineering Conferences


Map

http://research.csc.ncsu.edu/ase/semap/

http://www.uax.es

Trending topics

Replication of empirical
studies

The replication package

Recommendation systems

Automated Software Engineering

http://www.uax.es

3. Tools for Mining Software


Repositories

http://www.uax.es

Tools for Mining Software


Repositories

Mining tools

Libresoft Tools http://tools.libresoft.es/

CVSAnaly CVS/SVN/Git repositories log parser

MLStats Mailman and Mboxes parser

Bicho Bugzilla and SF.net tracker parser

Software Architecture Group (SWAG)


University of Waterloo

http://www.swag.uwaterloo.ca/tools.html

http://www.uax.es

4. Datasets for Mining Software


Repositories

http://www.uax.es

MSR Mining Challenge

Mirrors of the version archives and bug


databases for Mozilla Firefox and Eclipse

http://msr.uwaterloo.ca/msr2008/challenge/

Repository logs of over 500+ Gnome projects,


XML dump of the bug databases, and the
complete SVN repositories of 69 Gnome
projects

http://msr.uwaterloo.ca/msr2009/challenge/

http://www.uax.es

Ultimate Debian Database

Database with information about packages and


bug reports of Debian and Ubuntu

http://udd.debian.org/

http://www.uax.es

Eclipse bug database

Saarland University
Datasheets, databases, scripts, with
information about Eclipse bug reports for
several releases
http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/

http://www.uax.es

FLOSSMetrics

Databases about ~5000 open source projects


Control version repositories, mailing list
archives, bug tracking databases
MySQL dumps

Not very user friendly

Obtained using the Libresoft Tools

http://www.flossmetrics.org/

http://www.uax.es

FLOSSMole

Database with information about all the


SourceForge.net projects
~150,000 projects
Mainly metainformation, obtained through
parsing the web pages of the projects

No low level or fine grained information

http://flossmole.org

http://www.uax.es

PROMISE repository

All PROMISE papers must also submit a


package with the data used in the paper

http://promisedata.org/

101 datasets

Defect prediction (58)

Effort prediction (18)

General (9)

Model-based SE (7)

Text mining (9)

http://www.uax.es