Welcome to lcep’s documentation!¶
lcep¶
Classifying cancerous liver samples from gene expression data.
Free software: MIT
Documentation: https://lcep.readthedocs.io.
Features¶
Fully deterministic machine learning model based on XGBoost and MLflow using the mlf-core framework
Classify cancerous and healthy tissue samples from gene expression data
Credits¶
This package was created with mlf-core using cookiecutter.
Usage¶
Setup¶
mlf-core based mlflow projects require either Conda or Docker to be installed. The usage of Docker is highly preferred, since it ensures that system-intelligence can fetch all required and accessible hardware. This cannot be guaranteed for Mac let alone Windows environments.
Conda¶
There is no further setup required besides having Conda installed and CUDA configured for GPU support. mlflow will create a new environment for every run.
Docker¶
If you use Docker you should not need to build the Docker container manually, since it should be available on Github Packages or another registry. However, if you want to build it manually for e.g. development purposes, ensure that the names matches the defined name in the ``MLproject``file. This is sufficient to train on the CPU. If you want to train using the GPU you need to have the NVIDIA Container Toolkit installed.
Training¶
Training on the CPU¶
Set your desired environment in the MLproject file. Start training using mlflow run .
.
You need to disable CUDA to train on the CPU! See parameters.
Training using GPUs¶
Conda environments will automatically use the GPU if available.
Docker requires the accessible GPUs to be passed as runtime parameters. To train using all gpus run mlflow run . -A gpus=all
.
You can replace all
with specific GPU ids (e.g. 0) if desired.
Parameters¶
training-data
Path to the training data csv file['train.csv': string]
test-data
Path to the test data csv file['test.csv': string]
cuda
Whether to train with CUDA support (=GPU)['True': string]
max_epochs
Number of epochs to train[1000: int]
general-seed
Python, Random, Numpy seed[0: int]
xgboost-seed
XGBoost specific seed[0: int]
single-precision-histogram
Whether to enable single precision for histogram building['True': string]
Model¶
Overview¶
The hereby trained model classifies samples generated from gene expression data into cancerous or healthy.
Training and test data¶
Patients with cancer and without cancer were sequenced (RNA-Seq). All samples of the patients were assigned ‘cancerous’ or ‘healthy’. The RNA-Seq experiments generated reads, which are commonly associated with expression values per gene. These expression values were normalized into transcripts per million (TPM) values. Next, all genes were subject to pathway analysis. Any genes not present in any cancer associated pathway were discarded. Finally, the whole dataset was split into 75% training and 25% test data. lcep was trained with the training data and evaluated using the test data.
Model details¶
The model is based on XGBoost.
Training was conducted using a single GPU . Hence, gpu_hist
is the training algorithm of choice.
Evaluation¶
The model was evaluated on 20% of unseen test data. The reported root mean squared error origins from the test data.
The full training history is viewable by running the mlflow user interface inside the root directory of this project:
mlflow ui
.
Hyperparameter selection¶
The hyperparameters of this model were selected using a grid search approach.
single-precision-histogram
was enabled for faster trainingsubsample
was set to 0.7colsample_bytree
was set to 0.6learning_rate
was set to 0.2max_depth
was set to 3min_child_weight
was set to 1eval_metric
was set tologloss
objective
was set tobinary:logistic
Credits¶
Development Lead¶
Lukas Heumos <lukas.heumos@posteo.net>
Steffen Lemke <steffen.lemke@uni-tuebingen.de>
Contributors¶
None yet. Why not be the first?
Changelog¶
This project adheres to Semantic Versioning.
1.0.1 (2021-04-08)¶
Added
Fixed
Fixed train/test dataset float point variation of some samples. The new dataset was generated by the nextflow-lcep pipeline.
Dependencies
Deprecated
1.0.0 (2021-03-11)¶
Added
Added new train and test dataset based on TCGA-LIHC & GTEx (liver)
Added optimized hyperparameters to the model
Fixed
Dependencies
Deprecated
0.1.0 (2021-03-11)¶
Added
Created the project using mlf-core
Fixed
Dependencies
Deprecated
Contributor Covenant Code of Conduct¶
Our Pledge¶
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
Our Standards¶
Examples of behavior that contributes to creating a positive environment include:
Using welcoming and inclusive language
Being respectful of differing viewpoints and experiences
Gracefully accepting constructive criticism
Focusing on what is best for the community
Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
The use of sexualized language or imagery and unwelcome sexual attention or advances
Trolling, insulting/derogatory comments, and personal or political attacks
Public or private harassment
Publishing others’ private information, such as a physical or electronic address, without explicit permission
Other conduct which could reasonably be considered inappropriate in a professional setting
Our Responsibilities¶
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
Scope¶
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
Enforcement¶
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.
Attribution¶
This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html