Un dépôt GitHub pour mes programmes Rust

Article mis en ligne le 9 novembre 2021

dernière modification le 10 novembre 2021

par Laurent Bloch

Agencer les modules pour construire un exécutable

Ne reculant devant aucune vanité, je décide de placer mes programmes dans un dépôt GitHub. Auparavant, il faut que ces programmes soient organisés de façon à pouvoir construire un exécutable binaire, ce qui impose quelques corrections. Certaines subtilités de l’organisation des cageots (crates) et des modules m’échappent encore mais grâce à la coopération des internautes (merci Stack Overflow !) j’arrive à quelque chose qui fonctionne.

Pour construire l’exécutable je m’en remets à Cargo. Il faut lui indiquer que j’utilise un cageot externe, en l’occurrence simple-matrix de Nicolas Memeint, ce qui s’écrit ainsi dans le fichier Cargo.toml :

[package]
name = "needleman_wunsch"
version = "0.1.0"
authors = ["Laurent Bloch <lb@laurentbloch.org>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
simple-matrix = "0.1"

Le contenu du répertoire src est modifié comme suit.

Programme principal

// src/main.rs :

pub mod fasta_multiple_cmp;

use fasta_multiple_cmp::get_filenames;

fn main() {
    get_filenames();
}

Lire les séquences

// src/fasta_multiple_cmp.rs :

// https://linuxfr.org/forums/programmationautre/posts/rust-lire-des-donnees-de-type-i8-depuis-un-fichier
// https://www.it-swarm-fr.com/fr/file-io/quelle-est-la-maniere-de-facto-de-lire-et-decrire-des-fichiers-dans-rust-1.x/1054845808/
// https://docs.rs/simple-matrix/0.1.2/simple_matrix/

pub mod build_sequences_matrix;

use std::env;
use std::fs;
use std::fs::File;
use std::io;
use std::io::Read;
use std::io::{prelude::*, BufReader};
use std::io::Lines;
use std::fs::Metadata;
use std::str;

use crate::fasta_multiple_cmp::build_sequences_matrix::print_seq;
use crate::fasta_multiple_cmp::build_sequences_matrix::build_matrix;

pub struct Config {
    pub query_filename: String,
    pub bank_filename: String,
    pub match_bonus: f32,
    pub gap_penalty: f32
}

impl Config {
    pub fn new(args: &[String]) -> Config {
	if args.len() < 5 {
	    panic!("pas assez d'arguments");
	}
	let query_filename = args[1].clone();
	let bank_filename = args[2].clone();
	let match_bonus: f32 = args[3].parse()
	    .expect("Ce n'est pas un nombre !");
	let gap_penalty: f32 = args[4].parse()
	    .expect("Ce n'est pas un nombre !");
	
	Config {query_filename, bank_filename, match_bonus, gap_penalty}
    }
}

pub fn get_filenames() {
    let args: Vec<String> = env::args().collect();
    let config = Config::new(&args);
    
    println!("Alignement de {} avec {} \n", config.query_filename, config.bank_filename);
    
    let f_query = fasta_open_file(config.query_filename);
    let f_bank = fasta_open_file(config.bank_filename);

    read_sequences(f_query,
		   f_bank,
		   config.match_bonus,
		   config.gap_penalty);

}

fn fasta_open_file(filename: String) -> File {
    let f = File::open(filename).expect("Fichier non trouvé !");
    f
}

fn get_sequence<B: BufRead>(count: &mut u8, ident: &mut String, lines: &mut Lines<B>)
			    -> (String, Vec<u8>) {
    let mut sequence: (String, Vec<u8>) = (String::new(), vec![]);
    let mut sequence_nuc: Vec<u8> = vec![];
    
    for line in lines {
	let the_line = line.unwrap();
	if the_line.len() > 0 {
	    let first = &the_line[0..1];
	    match first {
		first if first == ">" => {
		    if *count == 0 {
			*ident = the_line.clone();
			*count += 1;
		    } else {
			sequence = (ident.to_string(), sequence_nuc.clone());
			println!("Numéro : {}", count);
			*ident = the_line.clone();
			sequence_nuc = vec![];
			*count += 1;
			return sequence;
		    }
		}
		first if first != ">" => {
		    sequence_nuc.extend(the_line.as_bytes())}
		&_ => {}
	    }
	}
    }
    sequence = (ident.to_string(), sequence_nuc.clone());
    println!("Numéro : {}", count);
    sequence
}

fn read_sequences(f_query: File,
		  f_bank: File,
		  match_bonus: f32,
		  gap_penalty: f32) {
    let fq = BufReader::new(&f_query);
    let mut fq_iter = fq.lines();
    let mut count: u8 = 0;
    let mut ident = String::new();
    let query_sequence = get_sequence(&mut count, &mut ident, &mut fq_iter);
    print_seq(&query_sequence);

    let fb = BufReader::new(&f_bank);
    let mut fb_iter = fb.lines();
    count = 0;
    loop {
	let bank_sequence = get_sequence(&mut count, &mut ident, &mut fb_iter);
	if bank_sequence.1.len() == 0 {
	    break} else {
	    //		print_seq(&bank_sequence);
	    build_matrix(&query_sequence,
			 &bank_sequence,
			 match_bonus,
			 gap_penalty);
	}
    }
}

Construire la matrice d’alignement, calculer les scores

// src/fasta_multiple_cmp/build_sequences_matrix.rs :

// This module was inspired by Vincent Esche's Seal crate,
// but simplified and much more basic, without mmap and so on.
// For pedagogic use.

use simple_matrix::Matrix;
use std::str;
use std::char;
    
pub fn build_matrix(sequence1: &(String, Vec<u8>),
		    sequence2: &(String, Vec<u8>),
		    match_bonus: f32,
		    gap_penalty: f32) {

    let l_seq1: usize = (sequence1.1).len();
    let l_seq2: usize = (sequence2.1).len();

    println!("Longueur première séquence : {} ", l_seq1);
    println!("Longueur seconde séquence : {} ", l_seq2);
	
    let mut the_mat: Matrix::<f32> = Matrix::new(l_seq2+1, l_seq1+1);

    init_matrix(&mut the_mat, l_seq2+1, l_seq1+1, 0.0);

    nw_matrix(&mut the_mat, l_seq2+1, l_seq1+1, match_bonus, gap_penalty, &sequence1.1, &sequence2.1);

    print_ident(&sequence1);
    print_ident(&sequence2);

//	print_matrix(&the_mat, &sequence2.1, l_seq2+1, l_seq1+1);

    print_score(&the_mat, l_seq2+1, l_seq1+1);
    
}

fn nw_matrix(the_mat: &mut Matrix::<f32>,
	     lin: usize,
	     col: usize,
	     match_bonus: f32,
	     gap_penalty: f32,
	     seq1: &Vec<u8>,
	     seq2: &Vec<u8>) {
    for j in 1..col {
	the_mat.set(0, j, gap_penalty * j as f32) ;
    }
    let mut score: f32 = 0.0;
    for i in 1..lin {
	the_mat.set(i, 0, gap_penalty * i as f32) ;
	for j in 1..col {
	    if seq1[j-1] == seq2[i-1] {
		score = match_bonus} else {
		score = 0.0}
	    the_mat.set(i, j, max3(the_mat.get(i-1,j-1).unwrap()
				   + score,
				   the_mat.get(i-1,j).unwrap()
				   + gap_penalty,
				   the_mat.get(i,j-1).unwrap()
				   + gap_penalty));
	}
    }
}

fn max3(v1: f32, v2: f32, v3: f32) -> f32 {
    let tmp = f32::max(v2, v3);
    if v1 > tmp {
	return v1 } else {
	return tmp };
}
    
fn init_matrix(the_mat: &mut Matrix::<f32>, lin: usize, col: usize, val: f32) {
    for i in 0..lin {
	for j in 0..col {
	    the_mat.set(i, j, val) ;
	}
    }
}

// "print_seq" affiche une séquence selon différents formats.
pub fn print_seq(sequence: &(String, Vec<u8>)) {
    println!("Ident : {:?}", sequence.0);
//		println!("Séquence : {:?}", sequence.1);
		let sequence_str = str::from_utf8(&sequence.1).unwrap().to_string();
    println!("Séquence : {}", &sequence_str);
}

fn print_vector(the_vec: &Vec<u8>) {
    let vec_str = str::from_utf8(the_vec).unwrap().to_string();
    print!("{} ", "   ");
    for c in vec_str.chars() {
	print!("{} ", c);
    }
    print!("{}", "   \n");
}
    
fn print_matrix(the_mat: &Matrix::<f32>, seq2: &Vec<u8>, lin: usize, col: usize) {
    for i in 0..lin {
	if i > 0 {print!("{} ", char::from(seq2[i-1]))} else
	{print!("{} ", " ")};
	for j in 0..col {
	    print!("{} ", the_mat.get(i, j).unwrap());
	}
	print!("{}", "\n")
    }
}

fn print_score(the_mat: &Matrix::<f32>, lin: usize, col: usize) {
    println!("Score de similarité : {} ", the_mat.get(lin-1, col-1).unwrap());
    print!("{}", "\n")
}

fn print_ident(sequence: &(String, Vec<u8>)) {
    println!("Ident : {:?}", sequence.0);
}

Créer un dépôt Git

Avec Git, chaque arborescence de répertoires qui contient le code source d’un projet devient un dépôt de ce projet, cf. cet excellent manuel en ligne (en français, ou presque toute autre langue au choix). Il suffit de lancer la commande d’initialisation du dépôt :

git init

qui créera les répertoires .git (données de référence du dépôt), src (le code des programmes) et target (les exécutables, données de déboguage et autres).

Il faut ensuite ajouter au dépôt les fichiers intéressants, ainsi par exemple :

git add src/fasta_multiple_cmp.rs
git add src/fasta_multiple_cmp/build_sequences_matrix.rs
...

Il est bon, pardon, indispensable, pour un dépôt public, de prévoir un document de licence, pour ce qui me concerne j’ai choisi la licence Apache, moins rigide que la GPL ou même que la LGPL, mais après tout dépend du contexte dans lequel vous travaillez et de vos jugements sur la question.

Il est très souhaitable de rédiger un document d’introduction qui explique ce que fait votre programme et comment il s’utilise. GitHub semble préférer que ce texte soit écrit selon le langage de balisage Markdown et se nomme README.md, vous pouvez consulter ici leREADME.md du projet concerné, que j’ai baptisé fasebare.

Les commandes précédentes programment des opérations mais ne les effectuent pas, il faudra pour cela les commettre, au moyen de la commande de commission, qui vous demandera de rédiger quelques mots pour expliquez ce que vous commettez :

git commit -a

GitHub

GitHub est de nos jours le dépôt de codes source à la mode pour les logiciels libres, bien que ce soit une entreprise commerciale, rachetée par Microsoft. Dès la page d’accueil on vous propose d’ouvrir un compte, c’est ainsi que tout commence. Une fois le compte créé vous pouvez créer un projet. Le site vous propose une interface Web, soit un cliquodrome, je trouve que c’est plus simple d’usage par ligne de commande. Pour ce faire GitHub vous demandera de créer un token d’authentification pour pouvoir accéder à votre dépôt depuis votre machine locale.

Ensuite, pour envoyer votre projet sur GitHub (où j’ai créé le projet fasebare, vide, mon identifiant est laubloch), depuis le répertoire racine de votre dépôt local, c’est simple :

git push https://github.com/laubloch/fasebare/

Construire un exécutable

Oui, pour créer un binaire exécutable avec Cargo :

cargo build --release

Le binaire sera ici :

./target/release/needleman_wunsch

Importer le projet dans Framagit

Framagit est un autre dépôt de projets au format Git, basé sur le logiciel GitLab. C’est un des nombreux services libres créés par l’association Framasoft, il aurait donc été incorrect que je n’y publie pas mon code. Rien n’est plus simple : aussitôt identifié sur Framagit, on me propose d’importer des projets depuis d’autres dépôts, notamment GitHub. Voilà qui fut fait. Mon programme d’écolier est maintenant doublement immortel.

README du paquet # fasebare

Synopsis

The crate needleman_wunsch of the fasebare package consists of two Rust modules :

* fasta_multiple_cmp provides functions to read biological sequences (DNA, RNA or proteins) in FASTA formatted files ;
* sequences_matrix provides functions to build an alignment matrix of two sequences and to compute their similarity score, according to the Needleman-Wunsch algorithm.

The Data directory contains test data with artificial sequence data, and also true sequences extracted from the Genbank databank, in order to try the programs.

The cargo build system will build a standalone program to be invoked from the command line. The program has been built and run only with the Linux OS, but maybe it would run with other OS.

Motivation

The first aim of this package was for the author to learn the programming language Rust, and to apply it to a domain he knows a bit, Bioinformatics. The author’s site gives some explanations of this approach (in French...).

These programs are intended for pedagogic use, if you use them for professional or scientific projects, it will be at your own risks.

Credits

These programs invoke the following crate :

* simple_matrix

To take into account the dependency to this package, the Cargo.toml file must be :

[package]
name = "needleman_wunsch"
version = "0.1.0"
authors = ["Laurent Bloch <lb@laurentbloch.org>"]
edition = "2018"

# See more keys and their definitions at
# https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
simple-matrix = "0.1"

The author found hints and inspiration from totof2000, Unknown, Nicolas Memeint.

Principle of operations (summary)

Usually biologists work about a sequence of interest, which we will name the “query sequence”, and they try to compare it with a batch of sequences, the “bank”, in order to select the sequences of the bank with the higher similarity scores.

The similarity scores between two sequences are computed according to the Needleman-Wunsch algorithm. This algorithm build an alignment matrix. One sequence has its letters placed horizontally on the top of the matrix, each letter on the top of a column. The second sequence has its letter placed vertically on the left of the matrix, each letter on the left of a row. One extra line is placed below the top sequence, and one extra column is placed on the right of the left sequence. Each cell of the matrix will contain the score of each individual pair of letters.

To fill the matrix, the program computes each score for each individual pair of letters according to one of three situations (definitions borrowed from Wikipedia) :

* Match : The two letters at the current index are the same.
* Mismatch : The two letters at the current index are different.
* Gap : The best alignment involves one letter aligning to a gap in the other sequence.

So the algorithm needs two parameters to work : the value of the gap penalty, and the value of the mismatch penalty (or, alternatively, the value of the match bonus, which is the solution adopted for our program).

You could refer to the Wikipedia article for further explanations and details.

To build and invoke the program :

For the developper, the command line to build the program is (from the base directory of the project) :

cargo build

Then, you can invoke the program is as follows :

cargo run <path to the file of the query sequence>
          <path to the file of the sequences bank>
          <value of the match bonus>
          <value of the gap penalty (negative or zero)>

For instance, with test files from this repository :

cargo run Data/seq_orchid2.fasta Data/sequences_orchid.fasta 1.0 -0.5

To build an executable binary file proceed as follows :

cargo build --release

The executable file will be there :

./target/release/needleman_wunsch

Remember, with Rust, no runtime, so this executable is executable anywhere with your data.

Dans la même rubrique

Votre inscription a été enregistrée avec succès !

Sommaire