(Scientific) Python environments with Anaconda

Problem: You have a PC or an SSH account in a computation server and you want to use the latest and greatest Python stack for your work. The main problem is that Python has a lot of external libraries which aren’t generally packaged by distro maintainers. One way to overcome this issue is to use pip, a command-line utility that fetches Python packages from PyPI (Python Package Index) and installs them globally (system-wide) or locally (user-wide). Although pip and PyPI duo seem to solve the dependency to distro packages, they have serious problems: Package dependencies and keeping your packages up-to-date can generally be messy and libraries may not be correctly packaged so that they work out-of-the-box in a scientific computing environment. Also, there isn’t the concept of isolated environments in pip.

Today we’ll be talking about another solution called Anaconda which is an advanced scientific Python suite developed by Continuum Analytics.

What is Anaconda?

Anaconda is nothing but a curated Python package repository and relevant command-line utilities to manage custom Python environments. By using Anaconda, you can:

  • Install everything (literally everything) related to Python in your HOME folder
  • Keep update packages easily
  • Create and manage several isolated environments in your HOME folder. A common example would be to have environments for Python 2 and Python 3.
  • Take advantage of many other great features if you’d like

Getting Started

There are 2 variants: Anaconda and Miniconda:

  • Anaconda contains the majority of the packages (A total of 195 packages as of June 2015) in the installer which is ~300MB.
  • Miniconda only ships the bare minimum and it is up to you to install the packages that you’d like to use afterwards. I always prefer this one.

So let’s head up to miniconda download page. Here you’ll see Python 2.x and Python 3.x links. The difference between them is that miniconda3 by default creates Python 3.4 installation environments. Note that this doesn’t mean you can’t create a Python 2.7 environment with miniconda3. Only their defaults are different. Since I’m still not familiar at all with Python 3.x, I’m downloading Python 2.7 64-bit bash installer for my laptop which is running Ubuntu 14.04 64-bit.

Installation

Run the installer script from a terminal:

ozan@ozan-Lenovo:~$ bash Miniconda-latest-Linux-x86_64.sh
Welcome to Miniconda 3.10.1 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>

The rest of the installation is easy: Accept the license and the default installation location ($HOME/miniconda). Once the packages are installed, the installer will ask you whether to prepend Miniconda install location to PATH or no. Saying yes to this (recommended) will automatically make your shell conda-aware. You can always to this afterwards by editing your .bashrc file. Now open a new terminal screen and type conda, if everything went well you’ll see the help page of the conda command.

The name of the default environment is root. Let’s list the available conda environments:

ozan@ozan-Lenovo:~$ conda env list
# conda environments:
#
root * /home/ozan/testconda

Good! Now the last step is to activate the environment:

ozan@ozan-Lenovo:~$ source activate root
discarding /home/ozan/testconda/bin from PATH
prepending /home/ozan/testconda/bin to PATH

source is a built-in shell command which applies a shell script to the current shell, e.g. the changes that the script does remains active in your shell. activate is a script installed by miniconda ($HOME/miniconda/bin/activate). When you run the above command, conda sets up Python related paths and variables for the running shell. From now on, you will be accessing the interpreter and packages installed in your local conda environment instead of the system-wide ones. This is actually what we were trying to achieve: You have total control of a Python stack installed locally inside your $HOME folder.

I highly recommend to add the activation command to your .bashrc as well so that firing up a new shell automatically puts you in your conda environment. The end of your .bashrc should look like this:

export PATH=/home/ozan/miniconda/bin:$PATH
source activate root

Let’s launch the Python interpreter to see the results:

ozan@ozan-Lenovo:~$ python
Python 2.7.9 |Continuum Analytics, Inc.| (default, Apr 14 2015, 12:54:25) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>>

If you see Continuum Analytics next to the Python version then everything is OK 🙂

Upgrade and Install

The first thing to do is to upgrade your environment so that new packages are fetched and installed from conda repositories:

ozan@ozan-Lenovo:~$ conda update --all

Just type yes and sit back. Once the update completes, you now have the latest and greatest bare minimum Python stack at your orders. Now let’s install ipython:

ozan@ozan-Lenovo:~$ conda install ipython
Fetching package metadata: ....
Solving package specifications: .
Package plan for installation in environment /home/ozan/testconda:
The following packages will be downloaded:
 package                    |            build
 ---------------------------|-----------------
 ipython-3.2.0              |           py27_0 3.4 MB

The following NEW packages will be INSTALLED:
ipython: 3.2.0-py27_0
Proceed ([y]/n)? y

ozan@ozan-Lenovo:~$ ipython
Python 2.7.10 |Continuum Analytics, Inc.| (default, May 28 2015, 17:02:03) 
Type "copyright", "credits" or "license" for more information.

IPython 3.2.0 -- An enhanced Interactive Python.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

Easy 🙂

Creating a new environment

There are several advanced features of conda regarding environments but here I will only show an example of creating a Python 3.4 environment named py34 and installing ipython inside of it:

ozan@ozan-Lenovo:~$ conda create -n py3 python=3.4 ipython

That’s it! Once the installation completes, you can switch to your new environment using the source activate <env name> trick:

ozan@ozan-Lenovo:~$ source activate py3
discarding /home/ozan/testconda/bin from PATH
prepending /home/ozan/testconda/envs/py3/bin to PATH

ozan@ozan-Lenovo:~$ ipython
Python 3.4.3 |Continuum Analytics, Inc.| (default, Jun 4 2015, 15:29:08) 
Type "copyright", "credits" or "license" for more information.

IPython 3.2.0 -- An enhanced Interactive Python.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org

A “source deactivate” will switch back to your default root environment.

(Note that since we have by default activated the root environment in .bashrc, launching a new shell will always bring you your Python 2.x environment.)

Where is Science?

Python by default is slow for doing scientific computations due to numerous design choices. NumPy (Numerical Python) overcomes this limitation and enriches Python with Matlab-like facilities. You can easily install NumPy by typing conda install numpy but instead let’s proceed to talk about some technical details.

BLAS (Basic Linear Algebra Subprograms)

BLAS (Basic Linear Algebra Subprograms) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto API for linear algebra libraries, with bindings for both C and Fortran.

Wikipedia

Any serious computation which is numerically expensive is guaranteed to rely on optimized BLAS libraries in background. Linux distributions (at least Fedora and Ubuntu) ships with ATLAS but you can find some other BLAS implementations in their repositories too. Intel has a proprietary BLAS implementation called MKL (Math Kernel Library) which is aggressively optimized for Intel CPU’s. If you are a student you can get an academic license from Intel and download MKL without a fee.

Nowadays, another implementation called OpenBLAS is highly recommended. Generally OpenBLAS and MKL are quite close in terms of performance. Note that it would be better to download and compile OpenBLAS on your computer instead of relying on distro packages in order to achieve best performance.

So now you may ask “What does the installed NumPy use as BLAS implementation?”. It is ATLAS but since conda creates a brand new environment in your $HOME folder, it can’t rely on distribution provided ATLAS which may not be installed at all. That’s why it is statically linked against ATLAS during compile time.

Now let’s try a simple dot product operation between large matrices using the NumPy installed from conda. Run the below snippet in ipython (%timeit is a special ipython magic which correctly calculates elapsed time for an operation):

import numpy as np
a = np.random.random_sample((1000, 1000))
b = np.random.random_sample((1000, 1000))
c = np.random.random_sample((2000, 2000))
d = np.random.random_sample((2000, 2000))
%timeit np.dot(a,b)
10 loops, best of 3: 74.8 ms per loop
%timeit np.dot(c,d)
1  loops, best of 3: 1.04 s per loop

On my 3 year old Core i5-2467M 1.6GHz laptop (2 physical cores), dot product of two 1000×1000 matrices took approximately ~75ms while the same operation for 2000×2000 matrices took ~1s.

(Note: Although BLAS implementations greatly benefits from multi-core CPU’s, Hyper-Threading (an extra execution thread per physical core) doesn’t help at all. That is why you’ll generally see that only 2/4 CPU’s are under heavy use during a BLAS operation if you have a dual-core CPU for example. Don’t panic, this is expected.)

(Update: As of February 2016, Anaconda started to ship with MKL without a need of an academic license.)

Conda Accelerate

Accelerate is an Anaconda Add-On that allows Anaconda to take advantage of modern multi-core and GPU architectures. Accelerate optimizes common operations like linear algebra, random-number generation, and Fourier transforms in standard Anaconda packages. Accelerate also contains revolutionary programming patterns in NumbaPro to enable Python developers to easily compile their own Python code to many-core, and GPU architectures.

Although Conda Accelerate is a paid add-on, it is free for academic usage. So if you apply for an academic license and put the license file in your $HOME/.continuum folder, “conda install accelerate” will install the Accelerate add-on which will automatically switch your several packages like NumPy, SciPy, etc. to MKL versions. Accelerate also installs CUDA & GPU related stuff which is fantastic if you are lucky enough to have a CUDA capable GPU.

Now let’s try the dot product benchmark again with MKL-enabled NumPy:

%timeit np.dot(a,b)
10 loops, best of 3: 75.1 ms per loop
%timeit np.dot(c,d)
1  loops, best of 3: 585 ms per loop

For the larger matrice case, MKL-enabled NumPy performed 2x better than the standard ATLAS one.

Numba & NumbaPro

Finally, If you’re really into optimizing the performance of your scientific solution, you can try Numba which is a JIT (Just-in-time) compiler for Python. (NumbaPro is the pro version shipped withing Accelerate). Numba uses a compiler framework called LLVM to dynamically generate native code for decorated Python snippets.

Conclusion

  • Use always a solution like Anaconda to ease your Python installations. Other alternatives may be Canopy, Pythonxy but I strongly recommend Anaconda as the team behind it not only does packaging but also develops innovative products, libraries, add-ons, etc. Take a look at their website and blogs.
  • Always keep in mind that the bare default installation will generally be suboptimal in terms of BLAS performance. Use MKL or OpenBLAS where possible. On GPU side, NVIDIA has a BLAS implementation called cuBLAS.
  • If you are diving into machine learning, especially deep learning, try hard to get some GPU’s as CPU-only deep learning is very time-consuming.
  • Take a look at other languages and frameworks. For deep learning current popular entities are:
    • Theano (GPU/CPU symbolic expression compiler for Python)
    • Pylearn2 (Neural network framework based on Theano)
    • Lasagne (Neural network framework based on Theano)
    • Keras (Neural network framework based on Theano)
    • Neon (A highly GPU optimized neural network framework in Python)
    • Torch (Neural network framework in Lua mainly developed by Facebook AI)
    • Caffe (Berkeley’s C++/Python CPU/GPU neural network toolbox)
    • cuDNN (NVIDIA’s CUDA optimized deep neural network library)

Google Photos – I (Göç ederken)

IMG_20130630_182018

(Önsöz: Okurken fark edeceğiniz gibi yazı sadece bu teknolojiden bahsetmeyi değil meraklı okuyucuları Linux komut satırıyla temas ettirmeyi de hedefliyor.)

Geçen hafta Google tarafından yayınlanan yeni Google Photos gelişkin özellikleriyle oldukça ilgi uyandırdı. Gerek kendi çektiğim gerek ta ICQ, MSN zamanlarında sağdan soldan aldığım fotoğraflar, sabit diskte bölük börçük duracağına dedim tam zamanıdır bahar temizliğinin!

Tanışma

Öncelikle uygulamanın mobil sürümüyle tanıştım. Biraz karışık geldi. Bir uygulamanın her tarafına hakim olmadan onu kullanmaya başlayamamak gibi bir huyum var. Şöyle uzaktan uzaktan bakıştım yani bir süre. Çok elleştirmedim fotoğraflarıma, otomatik senkronizasyon, yedekleme ne varsa hepsini kapadım hemen. Öncelikle telefondaki kamera diziniyle Instagram dizinini yedeklemesini istedim. Zira Instagram üzerinde de oldukça fazla fotoğrafım var ve sağolsun Instagram uygulaması işlenmiş fotoğrafları cihazda tutuyor.

Vakti zamanında bir heyecanla Picasa’ya yüklediğim DSLR ile çekilmiş bazı fotoğraflarım da otomatik olarak bu yeni Google Photos’a göç etmiş oldu bu arada. Bu da beni biraz motive etti çünkü kahrolsun ki mobil uygulamada anılar adeta HD kalitesinde parıldıyordu 🙂

Depolama Seçenekleri

Yüksek kalite: 16 megapikselden düşük fotoğraflar için geçerli sınırsız depolama imkanı. Sunucu tarafında kayıplı sıkıştırma da yapılıyor sanırım.

Özgün boyut: Google hesabınızın öntanımlı 15 GB depolama alanını (buraya GMail, Drive, vs. hizmetleri de dahil) kullanan ve fotoğrafları olduğu gibi bırakan seçenek. DSLR makine sahipleri için biçilmiş kaftan. 15 GB bana yetmez diyorsanız aylık 100 GB için fiyat 2$.

Nasıl işliyor?

Bence biraz karışık :/ Öncelikle telefonunuzdaki hangi dizinlerin yedeklenmesini istediğinizi seçiyorsunuz. Sonra senkronizasyon ayarlarınıza göre (3G/Kablosuz/Sadece şarjda seçenekleri mevcut) bulut alanıyla eşitleme/yedekleme yapılıyor. Siz bir albüm yaratmadığınız sürece tüm fotoğraflar ana ekranda (All photos) görünüyor. Öğrendiğim ve anlayabildiğim kadarıyla albümler elle yaratılan fotoğraf kümeleri.

Ve bilgisayarın başına oturdum…

Evdeki masaüstü bilgisayarımın diskinde yakın zamanlarda telefon değişikliği / sıfırlama esnasında kopyaladığım bir takım fotoğraflar var. İlk karşıma çıkan dizini 14 Aralık 2013 tarihinde telefondan kopyalamışım ve içerisinde 893 adet fotoğraf var. Öncelikle bir tanesini Google Photos web arayüzüne sürükledim ancak maalesef fotoğrafın çekilme zamanı bilgisini elde edemedi. Bu bilgi önemli yoksa fotoğraflar albümlerde çekim tarihine göre sıralanmamış olacak.

Fotoğraf çekim anına dair çeşitli bilgiler fotoğraf dosyalarının ön-bilgi (metadata) alanlarında saklı EXIF adı verilen bir veri yapısının içerisinde tutulur. EXIF standardı, çekim tarihi, enstantane, diyafram açıklığı, ISO değeri, beyaz dengesi,  flaşın açık/kapalı olduğu gibi çok çeşitli verileri saklayabiliyor. Telefonda çekilden fotoğraflarda da bu bilgiler mevcut ancak Instagram hizmeti büyük ihtimalle bu veriler kaldırıyor.

Dosyaların oluşturulma tarihleri ve adları hâlen çekim tarihini yansıtır şekilde durduğundan aklıma ilk gelen bu bilgileri dosyaların EXIF alanına enjekte etmek oldu. Komut satırımıza ışınlanıyoruz:

$ exif -cl | grep Date
0x001d GPS Date - - - - -
0x0132 Date and Time * - - - -
0x9003 Date and Time (Original) - - - - -
0x9004 Date and Time (Digitized) - - - - -

exif aracına -cl parametresi verildiğinde tüm EXIF etiketlerinin adlarını ve sayılarını gösteriyor. Bunların arasında grep ile Date ifadesini arayınca 4 adet etiket çıktı. Tahminim 0x9003 numaralı EXIF etiketini değiştirdiğimde tarih bilgisini dosyaya gömmüş olacağız. Biraz man sayfaları biraz Google’da gezindikten sonra şöyle bir kabuk betiği yazıyorum:

#!/bin/bash
PHOTO_DIR=$1
cd $1
mkdir exified
for PHOTO in $(ls IMG*jpg); do
 echo "Processing $PHOTO..."
 DATE_TAKEN=`echo $PHOTO | sed -r "s/^IMG_([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2})([0-9]{2}).jpg$/\1:\2:\3 \4:\5:\6/"`
 exif -c $PHOTO --ifd=EXIF --tag=0x9003 --set-value="${DATE_TAKEN}" --output="exified/$PHOTO"
done

Yukarıdaki betiğe parametre olarak bir dizin verdiğimde o dizinin altında exified adlı bir dizin yaratıp bu dizine EXIF tarih/saat verisi eklenmiş yeni fotoğrafları atıyor. Çalıştıralım:

$ cd Instagram-14-12-2013
$ chmod +x fix_exif.sh
$ time ./fix_exif.sh .
...
...
real 0m27.095s
user 0m1.203s
sys 0m4.996s

Betik 893 fotoğrafı 27 saniyede işleyerek EXIF verili kopyalarını exified altında oluşturdu. Şimdi bu dizinden bir fotoğrafı sürükleyerek Photos’a attığımda tüm fotoğrafların olduğu ana ekranda doğru tarihe yerleşiyor 🙂

Tools for scraping election results

I wrote 2 small Python scripts which will help people wanting to analyze the results of the election made on March 30 in Turkey.

These scripts fetches the ballot box URLs for a specific state or city and converts them to CSV file format.

I hope that this will help people to figure out inconsistency and fraud patterns from the data.

https://github.com/direnkod/secim-araclari

Ubuntu 13.10 for BeagleBone Black – Setting up zeroconf (Part II)

One thing which is very useful on embedded systems is to setup a zeroconf hostname so that you can access your BBB using SSH without actually knowing its IP address which may change from boot to boot. So how does this work?

  1. Install avahi-daemon on BBB:
    $ sudo apt-get install avahi-daemon

    After installing it, Ubuntu automatically starts it and enables it so that it gets launched on every boot.

  2. Change the hostname of BBB (default was arm for this Ubuntu image) to something meaningful like beaglebone by editing the file /etc/hostname. You can now reboot your BBB and it will be accessible on the network as beaglebone.local!
  3. Make sure you have nss-mdns installed on your host computer and mdns host lookup is enabled in /etc/nsswitch.conf. On Fedora, you have to install the package called nss-mdns from the repositories and change the hosts line in your /etc/nsswitch.conf file to look like below:
    hosts:      files myhostname mdns_minimal [NOTFOUND=return] dns

Now you should finally be able to ssh into your BBB with:

$ ssh ubuntu@beaglebone.local

Happy hacking!

Ubuntu 13.10 for BeagleBone Black (Part I)

I have been playing with the BeagleBone Black (BBB) for a week and I am quite happy but I noticed that the preloaded Angstrom distribution has some major limitations/issues for me:

  1. The package management tool opkg is constantly failing to fetch package updates
  2. SciPy, BLAS, ATLAS does not exist in Angstrom repositories. If you intend to build them manually from sources, you’ll see that you need a fortran compiler which is unfortunately not provided by Angstrom.
  3. Many tools are actually busybox binaries which are kinda stripped.
  4. git was sometimes failing to clone github repositories because of some SSL errors.
  5. pip was sometimes stalling while download Python packages because of some SSL errors/timeouts.
  6. No man pages.
  7. and so on…bbb

So I decided to switch to Ubuntu 13.10. The bad news is that the installer scripts that are around are not currently supporting to directly flash the onboard eMMC from your host computer. So I bought a Toshiba Class 10 8GB microSD card to install Ubuntu separately on the microSD card. Here are the steps:

  1. Put your microSD card in microSD-SD adapter if your computer does not have a microSD reader. Plug the adapter to your computer directly or using some SD-USB adapter
  2. Download the latest 13.10 saucy rootfs from Robert C. Nelson’s website: https://rcn-ee.net/deb/rootfs/saucy/
  3. Unpack the tarball and cd into it:
    $ tar xvf ubuntu-13.10-console-armhf-2013-11-09.tar.xz./ubuntu-13.10-console-armhf-2013-11-09/
    ./ubuntu-13.10-console-armhf-2013-11-09/user_password.list
    ./ubuntu-13.10-console-armhf-2013-11-09/initrd.img-3.12.0-armv7-x7
    ...
    $ cd ubuntu-13.10-console-armhf-2013-11-09
  4. Now you have to find out the device name for the microSD card. There is a script in the folder called setup_sdcard.sh, launch it using –probe-mmc parameter. Since I don’t remember the exact output of the command, I replaced the bytes and sectors with dots, but the point is that you have to determine which one of the storages correspond to your microSD card from the list:
    $ sudo ./setup_sdcard.sh --probe-mmc
    Are you sure? I Don't see [/dev/idontknow], here is what I do see...
    
    fdisk -l:
    Disk /dev/sda: 500.1 GB, 500107862016 bytes, 976773168 sectors
    Disk /dev/sdb:   7,9 GB, ....         bytes, ....      sectors
    
    lsblk:
    NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    sda      8:0    0 465,8G  0 disk 
    ├─sda1   8:1    0 156,9M  0 part 
    ├─sda2   8:2    0   750M  0 part 
    └─sda3   8:3    0 464,9G  0 part 
    sdb      8:16   0   7,9G  0 disk 
    ├─sdb1   8:17   0   7,8G  0 part
  5. Since I don’t have any storage with ~8GB capacity apart from the microSD, I understand that the card is assigned the block device sdb. Since the card was formatted, it also had a primary partition sdb1 but we don’t care about it. The point is that our microSD can be accessed using the device node /dev/sdb.
  6. Now we’re really running the script to prepare the microSD card. Note that if your host distribution don’t have the necessary tools installed (like git, uboot-tools, dosfstools, etc.) the script will quit and tell you how/what to install the missing packages:
    $ sudo ./setup_sdcard.sh --mmc /dev/sdb --uboot bone
  7. The above script will take some time to complete. (Note that the –uboot parameter defines the target board. You can run setup_sdcard.sh with –help to see other options and parameters.)
    After the script terminates, you can unplug your microSD card, insert it to your BBB and power on the device to boot into Ubuntu! (As far as I understand, if microSD card is bootable, the bootloader gives priority to it and you end up with Ubuntu running without any other intervention.)

If you power the BBB using the USB client port, you can ssh into BBB using the local ethernet interface which has the IP 192.168.7.2 assigned to it (I am not sure if this is a static/permanent assignment).

The default username@password for the image is: ubuntu@temppwd

Enjoy it!

GSU student accepted for outreach program for women

One of our students, Tülin İzer, has been accepted to the international Outreach Program for Women for Summer 2013.

Quoting from LWN(Linux weekly news):

Tülin İzer is working on parallelizing the x86 boot process with mentor PJ Waskiewicz of Intel. She is currently pursuing a bachelor’s degree in computer engineering at Galatasaray University in Istanbul, Turkey. Her application included fixes for several staging drivers.