Summary: | Bioinformatics is one of the emerging fields in Computer Sciences with an increasingly more impactful role, in a symbiotic-like association with Biology and Bioprocesses, aiding in the knowledge of complex mechanisms and elements in both these fields of knowledge. The permanent and exponential development of equipments that allow large-scale data acquisition, has set-in-motion the need to create methodologies to store and analyze that same data, in a manner that researchers can extract its meaning, with a high degree of confidence and precision, in a viable time-frame, and therefore help them in their research. However, there are techniques and procedures where the possibility of extracting detailed information in real-time is limited, either by the absence of adequate equipment or by the logistics’ impossibility often times associated with a thorough gathering of information regarding those processes. The industrial cultivation process is one of those cases where environmental values such as Dissolved Oxygen, pH, Temperature and others are available in real-time but the information regarding the complex molecular constituents of the cultivation are missing, only being obtainable by off-line analysis. In industry, as a way of minimizing the risk of contaminations, the number of samples collected for analysis along the fermentative process is always kept at a minimal level and can even be non-existent in most cases. Because of this, the real knowledge on the cultivation process is limited, most of the times, to the initial state of the cultivation and its final state, as obtaining exact readings along the cultivation is quite difficult. All control decisions on the system are based on indirect evaluations as the rate of oxygen consumption or the pH variation. This limited knowledge may impair the reproducibility of the cultivation process, as cells are living organisms that present a natural variability. That natural variability is further enhanced by slight variations of the environmental cultivation conditions. This is crucial in case of biopharmaceutics production due to the high regulatory constraints. DNA plasmid vaccines are increasingly moving to the forefront of pharmaceutical products due to their potential advantage over viral vectors, and due to the theoretical advantages of DNA-vaccines over subunit and whole cells vaccines. The plasmid vaccine production consists in the growth, in bioreactors, of bacteria such as Escherichia coli containing the plasmid vector with engineered DNA that is afterwards extracted. However, as previous referred, it is highly relevant to control the whole cultivation process, as there is still a great need for process optimization. This optimization can result in high-yield production with reduced production costs. One solution that has been presented for this kind of control and optimizations is based on computational simulation of the processes. Computer simulations, or in silico, are often used to quickly test multiple scenarios without the need to allocate specific resources, human or material, as for instance high-cost reagents, turning these processes into more viable ones. With the more complex work occurring in the early stages of model development. Furthermore, mathematical models may also be useful to estimate, along extensive periods of time, the complex molecular constituents of a cultivation process by using the real time analysis gathered by the sensors generally used in industry. This work’s objective is to use computational methodologies to determine the behavior of a recombinant E. coli culture designed to produce plasmids for DNA vaccination. This work was performed in the Engineering Faculty of the Catholic University of Portugal and the Instituto de Medicina Molecular. In this work we propose the use of a Multilayer Perceptron (MLP) in order to monitor and understand the behavior of E.coli DH5- containing the vector pVAX-LacZ plasmid, during different batch and fed-batch cultivations. The focus of this work consisted in studying the behavior of cultivations with different initial pre-set conditions concerning the carbon-source, pH and feeding strategy, and with intermediate perturbations determined experimentally. With this goal, a set of cultivations were defined as examples, in order to allow us to explore a wide universe in terms of variables, as well as establishing a comparison between cultivations with similar initial conditions. MLPs are part of a larger universe referred as Artificial Neural Networks (ANNs). They are considered universal approximators, allowing for the identification of complex patterns by learning training examples. These examples will influence future state predictions. This characteristic allows a great adaptability to different models as the main limitation of MLPs is centered in the quality and quantity of training examples, rather than pre-determined functions and parameters. Unlike conventional modelling techniques, MLPs rely on the data rather than theoretical assumptions. This means that the possibility of introducing bias in the pattern recognition is less likely. Moreover, MLPs can serve as hypothesis validators. In this work we were able to obtain model fit values (R2) that in most cases were superior to 0.7. These values are even more interesting when we take into account the number of variables we attempted to predict and cross it with the number of training examples we were able to produce. In order to achieve our goal we defined the following real time and off-line variables. The off-line variables were: concentration of Biomass, Plasmid, Glucose, Glycerol and Acetate. These variables are not quantified in real time, as it is required to extract a sample from the bioreactor and subsequently analyzed it. The on-line variables, acquired in real time were: Dissolved Oxygen Concentration, pH, Stirring Rate and Feeding Rate. The technical and logistics inability to quantify each variable at the same exact rate illustrates two fundamental issues with the basic cultivation monitoring process: the standardization of the moment in which the variables are quantified; and the determination of the next state of the cultivation. In this work, we establish that the prediction was made using 1-hour spaced intervals using a cross-validation training methodology. This 1-hour spacing was determined by analyzing data available from Martins (2008) and observing no significant increase in network prediction with 15 minute, 30 minute or 60 minute intervals. Finally, we present a possible methodology for optimizing fed-batch cultivations based on Genetic Algorithms (GA). In this approach, information and parameters of the trained MLP are used to create a cultivation policy that will be applied during the industrial process. Genetic Algorithms are evolutionary algorithms based on computational adaptations of biological evolutionary theories. Our Genetic Algorithm approach is based on a chromosome representation of a decision tree designed to determine the course of experimental action according to the state of the controlled variables. These evaluations are based on the values of Glycerol, Glucose and Acetate and according to their values a feeding rate is determined for the next time-point. This methodology in an early stage could allow the definition of a wider example space and then translate into a cultivation strategy closer to the optimal solution. This research work aims to answer this emerging need and contribute to the advance of the knowledge in the area, opening new paths for further research that natural and desirably will follow.
|