Scientific publications

February 28, 2018

Preamble: I wrote a draft of this post some months ago but did not publish it because of I found it at first lengthy, a bit too ambitious in its goals and not conclusive enough to be truly relevant among the large set of articles daily published on the topic. Yet, after stepping back from the area of research, I realize that these criteria do not matter. This post summarizes some ideas of mine at a given point and can all the same serves as a basis for further thinking.

This is the beginning of a series of posts to establish an overview of the state of applied physics research. This is mostly to get for myself clear ideas of the good and bad aspects of this field of research based of on my experience. This experience is, by nature, limited to the area of plasma physics and to what I have read from other related fields. This is an opinionated view and in no way a methodical scientific study.

It starts here with the scientific publications which are the visible parts of the iceberg and attract most of the attention from a wide audience.

The researcher has three types of relations with publications: as a reader to get information, as a reviewer to validate the results of colleagues and as a writer to present his own results. The failures and advantages of the present publishing system will be detailed for each type of relation and an attempt to draw some conclusions about what can be done or is already done to improve the process. I will also highlight the issues which are direct effects of the structure of the academic system and which cannot be solved by just changing the way articles are published.

Publications as a source of information

When you start to study a topic, you have to “stand on the shoulders of giants”. Publications are these shoulders. They provide the information you need to understand the status of your area of research: what is known, what is not known, what are the issues, what is unclear. The present research has four sources of information to access this existing knowledge: books, articles, internet and human relations.

  • The purpose of books is to cover a well established body of knowledge: you will find the tools, mostly theoretical, to use for your research. The problem is that books for specialists are very expansive, mostly only available on paper. Institutes have libraries where you can find them, but this is a very analog process. You have to go there, find the book, wait if it is already lent, get it for a limited amount of time. When you have to browse a lot of books, it is inconvenient. Fortunately now, you can now find on internet scanned versions of most of technical books. Yet, there are in PDF formats. But better than nothing.
  • Articles present the state of the art of research, topics which are not 100% sure but with new ideas, new data. The purpose of research is mostly to validate existing articles, or to invalidate them and to propose alternative solution or to unify several articles. This is your daily bread. All articles are now online. When you work in research in western countries, the price is transparent for you. Only the administration sees what an article costs. It is not the case in other countries and the success of SciHub show how big this price is. But, in my opinion, it is not the price in itself which is too high, it is the ratio service/price of the publication which is too low. Because, the question is: what do you get in a publication: a PDF file, most of the time limited in pages with the assurance that two or three people read it before and validated it. That’s it. But you get only an assurance, not direct insight in the reviewing process, you don’t know what issues where raised, how they were answered. You don’t know if other people find problems on the paper or confirmed its validity. Everything is completely opaque- In addition, a paper is mostly text and few graphs and schemes. You don’t have any direct access to the data, to the exact experimental protocols or codes used to get the results. Yet, technological solutions exist to make the process more transparent and reproducible but it is not used for this purpose. Technological tools are mostly used to streamline the publications in their present state and to make their number skyrocket. It has never been so easy to publish articles. You could do that every month and some scientists just do that, because it is better for the career. So you get overwhelmed by the amount of articles with a very low signal to noise ratio. Scientific publication is at the age of Youtube comments. This is where the two remaining sources of information play a role.
  • Internet offers knowledge beyond articles. You can find blog posts by fellow scientists who talk there more freely about their research and where you can catch some details which where missing from the publications. You can find contributions by laymen who just have fun with technical stuff and spend a lot of time shooting videos with their GoPro camera to explain some obscure electronic construction, which is of utter importance to develop your experiment. The information there is not organized, not well structure but offers a wider range of innovative solutions to expose scientific content.
  • The last source of information is your social network, your colleagues, your fellow scientists, the guy or the lady who you discuss with at the coffee and who will offer you his or her experience and contacts to explain you something you did not understand. They represent the unstructured informal channels of information communication. This is a precious source, which is both underestimated (the myth of the lone genius scientist, but this is another story) and very hard to reach. It requires competencies in networking, team building, public outreaching, which are far from being a major of the scientific education.

Publications for reviewers

The second relation that a scientist can have with publications is when he is requested to review them. From my experience and the one from people I know, this is most of the time a gratifying experience because of two aspects: first, it is a recognition of your expertise and second, it brings you to topics related to your area but where you wouldn’t have necessarily spent time studying. It opens an opportunity for the curious people. So, even if it takes time on my planning, it is always a pleasure to review another paper. Yet, what I miss is the lack of real discussion with the author. You have two or three passes of questions/answers and that’s it. In my opinion, the process is not iterative enough and not enough open to more people. The counter arguments are usually twofold: first, versioning could bring instability to the system: without a definitive version of paper, there is no solid ground, no reference which can be used to progress further. This is true, but only partly. Indeed, psychologically it is very helpful to have a finished and published paper. You have the feeling to get something done and it can be used as a showcase for your work. Actually you know the approximations, the uncertainties present in the work. But you wipe them away and get a boost of energy for the next publication. But, like in software, you could imagine major and minor releases, stable and unstable editions. That would perfectly fit the research process. One paper which is improved, enriched, extended following the progress of your work. The second reproach would be that open review by a wide audience would lead to less review because nobody would feel responsible for it and would not spend time for that. This is also only partly true. It would be necessary to have main reviewers like you have main contributors on pieces of software. They would be in charge of the main review. But, I am pretty sure, that if you open the review to the community, and you enable a rational dialogue between the authors, the main contributors and the community, you can dramatically improve the quality of papers. There are many publications where I would like to comment, to ask, to suggest. But it is not possible in a systematic way. You can do that at conferences when you meet the authors but this is informal and most of the time, without any follow-up. The key for success is to create the proper framework and adequate tools to facilitate these processes.

Publications for authors

You have managed to get results and you want to broadcast them to your scientific community. There is only one way: to publish in a journal. You are here confronted with several issues: the choice of the journal, which is mostly defined by a compromise between the relevance of your results, the impact factors and who your co-authors are. Then, you are constrained by a format: number of pages, non-interactive, text and images only. This is not necessarily a disadvantage. For instance, I find the Physics Review Letters absolutely awful for the reader. How useful 4 pages can be? It is like understanding the Syrian crisis only trough a dispatch from a press agency. But for the author, it is a very interesting exercise: it forces you to extract the essence of your research, to determine what makes your results interesting and nothing else. It brings a lot of structure in the scientific thinking. Yet, a lot of this effort is spoiled by the time needed to format the article. No publisher has ever spent time and money improving publishing tools, plot creators and other useful editing framework. At the best we get templates and Latex libraries. All interesting software come from outside publishing. This is also true for most research institutes which push to more publications without trying to improve the editing process itself.

The future of science publishing

There are tools of all sorts to improve the communication of science. Yet, the situation looks like stale with disagreement and divergent interests between the stakeholders of science (researchers, institutes, publishers). As in engineering, there are two ways to design a new system of science communication: top-down with an initiative of the decision-makers; this can happen with a change of generation of science leaders with people who have only known a publishing system in crisis and aware of the problem and of the possible solutions. Or bottom-up with the self-organization of scientists who collectively manage to agree on the standardization a more effective way to communicate and validate the scientific knowledge. Solutions will probably emerge, if they emerge, from both directions and will require time and patience. But these two elements are the most effective weapons of Science.


The art of science communication

September 20, 2016

If only science was a game only between you and the nature! Alas, it is not simple, our environment is far too complicated to be understood by an individual. Even if the lonely genius Einstein myth persists, the reality is that science, whatever its domain of application, is an endeavor at the scale of humanity. A problem can be address only through cooperation, discussions, disputes. Consequently, the talent of the scientist resides as well in its communicating capabilities as in theoretical and experimental proficiency.

I came to dig a bit more about this topic while reading this article highlighting the need for a simplification of scientific communication. I agree that there is a problem of communication in science, but it may not be due to only the elitist style. If we want to better understand the issue, we have to consider the different types and levels of communication that the scientist has to deal with. The frontier between the different types is rather blurry and depends on the targeted audience and the purpose of the communication. But we can distinguish the following levels.

The first level of communication is the routine communication with his teammates, people working on the same topic and who aim at solving the same scientific problems. It is a highly specialized discussion where use of jargon is recommended to keep a high level of accuracy and avoid misunderstandings. The communication is in this case a mixture of equation writing, drawing, exchange of code and rational discussion. This is a difficult exercise because it is absolutely necessary to be sure that the participants to the discussion will share at the end the same understanding of the problem and of the possible solutions. From experience, a lot of time is lost because of misunderstandings. It is also difficult because the scientist often think that discussion with colleagues is a loss of time at the expense of pure individual thinking.

The second level of communication is the publication: it can be a report, an article, a digital notebook. The purpose here is to communicate in detail the method, the results, the analysis and the conclusions of the work so that your peers can try to reproduce, to falsify, to confirm or to improve your work. Therefore, it has to be clear, accurate and complete. This level is typically what is expected from a scientist. There is a lot of discussion ongoing on the problems of reproducibility, of peer reviewing and of journals impact factors but this is a little bit different story.

The third level of communication is the oral presentation. The purpose here is to attract the attention of the scientific community on your work, either to get collaboration, help, contradiction, funding.  An oral presentation is, by definition, limited in time and thus can focus only on a limited number of points. Therefore it cannot address technicalities. The communication has to highlight some key ideas, it has to activate some triggers in the audience to motivate them to look at your work in more detail (through communication of the second and first level). Honestly, given what I see during conferences this is an exercise which is, most of the time, poorly done. Slides overloaded with plots and texts, no coherent structure, no context explained, no vision. I suspect that most scientists fear that they cannot use storytelling and simple slides without being criticized for lack of rigor. There is a balance to find. A presentation, even a scientific one, has to be compelling.

The last level of communication is the communication with the public. Void. Blank. This is the ultimate difficult exercise. The hell on earth. And it has become worse in the last years. Before, the main contact with the public was through the media and the journalists and only some chosen distinguished scientists were allowed to talk to the journalists. So the difficult exercise of explaining science to a broad audience was to the charge of the journalist. Difficult because you have to find the compromise between the accuracy of the facts and the interest of the public. We touch here the heart of the problem: the scientific method (but not the results!) is fundamentally not attracting. By definition, it is rational and not emotional. Most people expect emotion. There can only be a conflict when we want to communicate about science. Anyway, with the development of Internet and o the social networks, the separation between the public and the scientists has faded out. We are now in position to talk face to face with the audience. And the audience expects a communication with the scientists, it expects him to play a social role, even political one when it tackles the topic of climate evolution or bio-technologies. This is a role for which the scientist is almost not prepared. The difficulty is even greater now that the society faces a problem with the facts. The exact reason for this phenomenon is unclear: the explosion of data, the increased complexity and hyper specialization of science, the degraded education. Whatever it is, people tend to pay less and less attention to facts, data and rational discourse (if you want some proof, listen to some well known politicians; a more in-depth discussion is to be found in Rhys Taylor’s blog). So the scientist is expected to speak out but the type of communication for which he is trained will not be heard. It can only end in a wrong way: either he shows viewgraphs on TV or he will moan “trust me!” (which is the worst thing to say in science). Honestly, I still have no answer to bring as for the behavior to adopt in this case. This is still an experimental ground. But the scientist must enter this ground and communicate with the audience and find strategies to make his voice loud and clear so that the public gets interested in science again.

The philosophical physicist

August 12, 2016

I could have called this post “The war between science and philosophy” or “the zero sum game” but I found it too childish to tackle a subject which is important for the future of physics. There was a recent update in the “discussion” of the role of philosophy in science. Massimo Pugliucci, Sabine Hossenfelder, to take the most recent insightful articles, took position on the claim that “philosophy is not useful to do physics”. As a baseline physicist (i.e. not one working on the fundamental questions of the universe), I have to react and say why I need philosophy. First, please excuse in advance my lack of clarity and of accuracy: I do not have the experience and talent of most participants to this debate. Yet I hope to convey enough of my message to make it useful.

I would first like to cut short one objection : that I am not a theoretical physicist working on “advanced subjects” like string theories, quantum loop gravity and thus I am not entitled to discuss this kind of fundamental issues. Indeed, I am a plasma physicist; I try to understand the phenomena occurring in a plasma, how it is produced, how it reacts to some stimuli. The most “advanced” tool that I use is Quantum Field Theory to calculate some in  the measurement of the plasma electric field in a magnetized plasma through Stark effect. Beyond that, I follow what happens in theoretical physics (I do not like this term because it implies a fundamental separation between experiment and theory) and I enjoy what I am able to grasp of the beauty of the constructions (as I enjoy the glimpse at the category theory or at the harmonic forms) but I have no practical experience there. Yet, I think that the reflection occurring at the level of theoretical physics affects the whole physics, whatever the domain, otherwise it would be a strong, if not deadly, blow at its coherence.

To address now the core of my ideas: as a physicist, philosophy is useful for me at two levels: first, at a practical level, because I am an human and not a pure rational machine and it is sometimes difficult to bridge the gap between the human part and the physicist part. Second, at a theoretical level, because the goal of a physicist, more generally of a scientist, is to understand the world as a whole and, unfortunately science fails at some point. Let’s examine these two points in more detail.

The job as a physicist is to apply the scientific method, which is characterized in the daily life by two characteristics: rationality and falsifiability. You take some assumptions, you derive a model from them and experimental predictions from the model, you do some tests and check if you validate or not the model. If not, you check that your chain of thoughts is rational and if it is, you change the assumption. So, basically, from the assumptions to the test/theory comparison, it is basically algorithms (sorting, pattern matching, tree traversing) in actions , except that for the moment only human brains can deal with the fuzziness of reality and the absence of clear-cut borders to the area of investigation, you can always find new ramifications to other topics and you have to expand your analysis. But computers are progressing fast and taking over a big part of this work.

But what about the assumptions, where are they coming from? By deriving them from other assumptions. Good, you see the problem. So, there is always a moment (or even several) in the day of the physicist, when all scientific methods are exhausted, where he scratches his head with a sigh. What is the practical solution there? he takes height: he tries to establish analogies with other problems, he conceives random or impossible assumptions, he drinks a coffee or goes to the theatre until the inspiration comes back. But the most effective solution is to go to the office of his colleague and discuss. And when the problem is serious (i.e. all scientific ways are exhausted), the discussion is of philosophical nature (even if not with the quality of experienced philosophers): he tries with his colleague to elaborate concepts with words. Who said that words were not accurate enough to do science? They are not as accurate as equations, but their fuzzy nature is of a lot of help when your mind is trapped by the rigidity of the equations. They give you the room to expand the mind and to discuss with your colleagues. How many scientists discuss only with equations? This is not for nothing that it is asked to reduce the number of equations in a presentation: they are a bad tool for discussion and presentations are an invitation to discussion. The philosophical discussion reduces the accuracy of the ideas but gives more flexibility and opens new areas. In this sense it is complementary of the scientific method. By the discussion (with yourself or with your colleagues) you explore new ideas and you establish new assumptions. When you come to an agreement, you apply the scientific method to them and the machine is running again.

This is also where you understand that experimental results are very useful, not only to validate or invalidate a theory, but to discuss: they are as fuzzy,  or even fuzzier, as words: the experimental between two experimental sets of data will never be perfectly linear, you will have some scattering which will invite to discussion: is it really linear? Should we add a bit of non-linearity to the interpretation? New ideas often happen from the discussion of experimental results.

This is why the scientists should be more trained to the philosophical method: this would improve their discussions and give the tools to elaborate concepts more easily before transforming them in scientific models. It will also probably improve the quality of the human relations and remind them that they are not purely rational machines (and maybe prevent some nervous breakdowns).

The second level of interest for philosophy is more fundamental. There is a point where the scientific method does not work when you try to understand the world when you live. Actually, it breaks for most of the daily issues (except if you live in a lab or your name’s Sheldon): your relations with the society, politics or your love affair. You can write a numerical model of your relation and test it. If the test fails, it will not be possible to change the model! Facing this situation, either you just live your life or, if you really want to understand, philosophy is the only possible rational way to approach the problem. This is only what you can do when you meet the absurd, as defined by Albert Camus in the Myth of Sisyphus: the absurd arises when the human need to understand meets the unreasonableness of the world, when “my appetite for the absolute and for unity” meets “the impossibility of reducing this world to a rational and reasonable principle.”. The worst moment for a scientist.

Of course, you can say that, in the end, physics will explain everything (we could discuss that, personally I am not convinced, not with the present tools), we are just limited for the moment by our ignorance. Sure but now is the moment where we live and if we want to avoid too much frustration, we have to use all possible rational tools to quench our thirst of knowledge or, for the least, to deal with the world.


About Drupal

July 19, 2016

Our plasma source project involves several teams across Europe. We wanted a centralized source of information remotely accessible. Our idea was to have an intranet where we could store the documentation, the to-do lists, a gallery of pictures and videos. And we needed a solution which fast and easy to deploy. After so quick trade-offs, we chose Drupal, which is based on a classical html/php/MySQL stack.


The big advantage is that, indeed, you get a polished solution very quickly, I mean in few weeks. Everything is controlled through the integrated administrator’s GUI and the online documentation is abundant. Its use is smooth and I had a very low down time.

So if you just want an intranet with standard features, Drupal is really the right solution. Yet, in parallel, we have developed our data processing system Gilgamesh which is based on Jupyter and thus on Tornado in Python. As a result we found ourselves with two systems with different architectures. Of course, they have different purposes but for some applications, it would be interesting that we have bridges between the two systems. For instance, in Gilgamesh, you can make references to papers in Latex style; it would be useful to reference documents which are in the Drupal system. In theory, it should be possible since the document reference is saved in a MySQL database and the document itself in the filesystem. But the architecture is so different that, practically, the interface is a nightmare to develop.

Therefore, in the future, and for the next project, I will avoid Drupal and start any Intranet on a Tornado solution. In this case it will be easier to integrate it in more complex systems like Jupyter.


JupyterLab: first review

July 15, 2016

A pre-alpha version of JupyterLab has officially been released: you can see the details of the reasons and advantages on the Jupyter Blog and on the Bloomberg Blog. You will find there the slides and video of the Scipy2016 talk.

I wanted to give a first review of this new version of Jupyter. I have indeed installed it for our Gilgamesh Data Processing System and tested a little bit.

There are two parts: the user view and the developer view.

JupyterLab for the user

The first feeling at the start is that you get a clean desktop application in your browser: you have several movable panes, you have icons to start the application you need: notebook, console or the about panel. And you have the file manager which is FAR better than the Jupyter dashboard: you can move the files between the folders, you can drag and drop. It is very practical. You have easily accessible help pages and you can move your notebooks or consoles in panes side-by-side.

Graphically, it is not yet finished: I find the color scheme a bit dull. But following the activity on GitHub, the designer are working hard on improving that.

There is one usability issue in my opinion: the menu with the commands. Why is it on the side of the file manager outside the notebook? It is not intuitive at all.

As for the notebook itself, I am not quite sure but I have the feeling that the display is a bit slower than the classical notebook. But this is to confirm on a daily use. And in any case, it does not disturb the manipulation of the cells.

Thus, we have there a useful product with clear improvements other Jupyter. There are glitches but we have to keep in mind that this is only a pre-alpha release. It is already a high level of quality for such an early release. In addition, we have to understand the philosophy of JupyterLab: it is not and end product, it is an infrastructure to connect your plugin and develop your own product tailored to your needs. This is why it is important to see what is under the hood.

JupyterLab for the developer

First a note of caution: I am not a high level  front-end developer; so this review is based mainly on my comparison with the front-end of the standard version of Jupyter.

The main idea to note: JupyterLab is a front-end: there is not a single part of the code that changes the python server side (based on Tornado). So basically you can run Jupyter and JupyterLab on the same instance of the server (you just redirect to the right webpage to get interface you want).

It is based on TypeScript and on the PhosphorJS which provides widgets (menus, frames,…), messaging between objects and self-aware objects a la traitlets (when their properties change, they fire signals). The result is a very clean structure, modular and logical. You build your application by assembling plugins and widgets. The communication between them is almost automatic (almost!). The communication with the Jupyter server is realized through the jupyter-js-services API (which still is a bit confuse in my opinion, but this is more related to my limited abilities in JS programming).

What I have no tested yet is the use and development of ipywidgets and how the backbone architecture is integrated in the JupyterLab architecture. But I think it can go only in a better direction.

To conclude JupyterLab offers a set of front-end tools to easily modify or extend the Jupyter notebooks: if you don’t want a console, you can remove it, or you can add your own one, you can add notebooks with special layouts (for presentation or dashboards) or you can imagine more exotic plugins. For instance, for Gilgamesh, I am developing a plugin for a kind of “JupyterTalk”: the notebook is not saved anymore as a file but in a database. Several users can connect to it, each having his own kernels and type their own cells (each cell is identified by the username). But the display is common to all users: you see your cells and the cells from other users. So you get a chat with a succession of messages, which are more than text but real Jupyter cells (in markdown or code) with their output. So you can have a discussion like in a chat but with the power of a kernel behind to display data, run algorithms.  This is something made possible by the flexibility of Jupyter. You can have Augmented Discussions.


JupyterLab is the next step on the way to develop an ecosystem instead of a simple application. This looks like a bright strategic development and I am eager to see what will come out the imagination of the community. I think there are many possibilities open far beyond the notebook. JupyterLab is a new layer above the Operating System: it is the Computing System in charge of connecting the user with his kernels to support and enhance his work. Kernels can be languages but also interfaces with hardware (a python kernel on a Raspberry Pi can give access to the GPIO ports and the associated peripherals). Therefore it will offer to your narrative computing access to data,algorithms and hardware. Very promising. Good Job Jupyter developers!

Jupyter in real life – Part 3: return on experience

July 5, 2016

I have presented in the previous part the design of our data processing platform. The launch of the application was progressive with at the beginning only two beta testers; now I have eight regular users and I plan a maximum of 15 participants (please remember that the platform was initially designed for a small team). So I have now a bit of experience with running a multi-users Jupyter system and learnt of the advantages and issues related to the method. This is what I want to present now.

Technical choices

I am still hesitating about two choices I have made for the processing library: HDF5 (via h5py) and Pandas. I am not sure if they bring more advantages or more drawbacks.

  • For H5py (but it is basically the same for PyTables): it provides a clean API to save in a hierarchical way your raw data. Your data come from the diagnostics and you can put in nicely prepared groups, subgroups and metadata. As far as I understand,HDF5 is supposed to deal with huge files: you are supposed to put all your experimental data in the same file; it is conceived as a replacement of the traditional directory tree of your filesystem. I didn’t do that because my natural instinct fears big files and what happens to them if they get corrupted. And some of them have already been corrupted: [so it happens] ( By writing a file for each experiment, I lose the advantage of manipulating in one block the metadata of each experiment. Let’s say that I want to compare the maximum magnetic field from experiment to experiment; I have to open each file, read the magnetic field, close the file, open the next one and so on. With one single file, I would have simply iterated on all groups. To circumvent this problem, I have established a parallel database that gathers all metadata. It is far from being the ideal solution; when I change metadata, I need to do the writing operation in double: once in the hdf5 file and once in the database. Another issue with hdf5 is that it is ideal for frozen data structure: you get raw data and you “freeze” them in a hdf5 file. But as soon as you want to modify these data (for example, you want to add level-1 processed data), it becomes to be unclean. And to finish, the API is not suited for concurrent writing: I need to impose one administrator who is the only one allowed to write in it. OK, again, for raw data it is not a problem, but as soon as you want people to add processed data to these files, it becomes just painful. I have no ideal solution to these issues. Looking around, the general solution is based on the standard filesystem. I am still not sure this is the right way either, especially to manage metadata associated with each signal.

  • For Pandas, I am also in doubt. It is really powerful to aggregate data (you need one single line to get the average and standard deviation or other attributes of a time series and display it for several experiments). But there are many cases where you have to reverse to numpy arrays and it adds long expressions in your python code. Moreover, access to a single point in a dataframe also requires a circonvoluted style.

There is also a more fundamental point: how to manage the API. I took the obvious solution to put the API (all the functions specific to our experiments like plasma models) on the server where the IPython kernels are running. So each kernel has access to this API. Main advantages: it is centralized and all changes are reflected to the users immediately; you know that these users all have the same models and the same functions. But the solution comes also with drawbacks: this is research: the models evolve quickly: the underlying functions have to follow these changes. But an API has to be stable, otherwise it is not usable. How do you sole these opposite constraints? I have no clear-cut answer: sometimes I have to change the functions and the associated parameters and it breaks the existing notebooks. Sometimes, I create new functions. But it is not very clean. In addition, the access to the content of the API, the source code is not easy; you can use a magic command to do that, but it doesn’T give you a very nice display. A more beautiful idea, which I am implementing, is to use the notebooks as the support for the APIs. Basically, you write all your APIs functions in a set of notebooks (with the great advantage that you add text and pictures or whatever necessary to explain your code and your models) and you put these notebooks in the central repository. Now you can create a notebook, and instead of loading a python code with the import, you load the API notebooks like a module. You can even affect version numbers to the APInotebooks, so that you can keep the compatibility when your API is evolving: you just have to call the right version of the API. You can also copy an API notebook, modify it to add some functionalities and, when these changes are validated, you can share it with others on the central repository. One step further is to use these API notebooks to provide web services.


Jupyter in teamwork is great: you write a notebook, you transfer it to your teammate, he can execute it just as it is: you have the same data, the same API; he can exactly do what you did and correct or improve your work. The principle of narrative computing is also very helpful: you can comment, explain with images, figures, whatever you need to your team. This really improves the communication and debugging of problems in code but also in physics models. In addition, the seaborn module really brings a decisive visual gain over classical tools. There is a big way for improvement and, in my opinion, the future is really bright provided that we bring these improvements to life. I will talk about them at the end. But even when the solution you propose clearly brings big advantages, it is not enough to make it available to the users without a strong advertising and a strong technical support. In all cases it requires time to impose it as the reference choice to do data processing. In the first days, the most used function was ‘export’ which makes it possible to transfer data to other tools like matlab. Several actions are necessary to reverse the trend: to propose notebook tutorials, an in-depth documentation and in-person training. You choose first the early adopters, the users who are ready to test new products (and there are not so many of them), you run together through some examples, you make some comparison with his previous codes and progressively push him to stick with your solution.

Other good points are the widgets and the dashboard extension: you can add an interactive part to your notebook, which simplifies like in several situation. Many widgets are available, you can adapt them to your needs or create new ones. Once you have working examples, it is rather straightforward to make a new one (it is more difficult to make a nice one! Frontend physicits are welcome). So you can publish an overview of your last experiment on the big screen with all important parameters; or you can display a list of experiments and select the one where you can get the plot of the main parameters. This is really useful. The layout possibilities are for the moment lacking a bit of flexibility; maybe I do not use it in the best way or the code is still in its infancy. But it can only become better (but [some will say] ( that it will be difficult because of the old technologies used; old meaning here not [angular.js] ( In this sense, you can have a look at JupyterLab, which could be the future version of Jupyter: the frontend is entirely rebuilt from scratch based on TypeScript and PhosphorJS, which gives a cleaner code, and an awesome desktop-like application UI.

But let’s go back to the present version: at one point, you will get on your account plenty of notebooks, some in classical narrative fashion, other with the dashboard aspect. And here we reach a present limitation of Jupyter: the management of notebook in the tree dashboard is awful: you can duplicate and delete, that’s it. Normally, Jupyter notebooks are stored on the local filesystem and the user can manipulate all his data with the native file explorer. But in our case, with a database filesystem, it is not possible: Jupyter has to integrate a full-fledged file manager. JupyterLab will have it but in the meanwhile, the maintenance of proper shared set of notebooks is difficult.

Future step

I am really satisfied by the result and how Jupyter, with a central data API, really improves the research workflow. I see one direction of long-term improvement which can radically change the way to do experiments. For the moment, Jupyter is used only to process the data. The configuration of the experiment and the setup of the experiment is done on a dedicated software (in our case Siemens WinCC) through a graphical interface which is our interface to the hardware (a Simatic). Now imagine that you can install and develop a kernel for your signal controllers and monitors. Let’s say that you have a rack of Raspberry Pis, Arduinos, RedPitayas with one of them used as a supervisor. You can install a IPython kernel on it with an API which defines the hardware logic (how controllers and diagnostics are interrelated, dogwatchers, control loops and so on, with the RedPitaya you can even have a FPGA part for fast processing^) and offers a set of commands to access this hardware with a given configuration. This kernel can be accessed from a Jupyter with a notebook, thus offering large possibilities: the most classical one would be to write ipywidgets to get back the usual GUI with knobs and displays. But we can imagine more interesting solutions: instead of writing on a paper your experimental protocol and entering the corresponding program in the interface, you can create code to let the computer establish itself the experimental sequence. Let’s take a concrete example: we want to see how the plasma density is evolving in function of the operating parameters (power, magnetic field, pressure). We can define by hand the series of tests and the way each parameter will evolve. It is not straightforward because the effect of the operating parameters depends on how you make them evolve during the test. So, you have to check in the previous experiments how they correlate and establish which sequences are the best (ramp in power first, then ramp in magnetic field, then gas injection for instance). Now, since you have both the data, the controller and the computing power available in your notebook you can try to automate the sequence: you train your neural network on the previous sets of data to highlight the interesting patterns for your objective and then you apply this pattern to the next discharges. If you get the results you want, good; otherwise, you use these new results to improve the controller. Yes, you are in a closed loop with the computer having access both to the inputs and the outputs, the ideal case for machine learning. And experimentalists were thinking that their job would never be threatened by machines!

Jupyter in real life – Part 2: design

July 5, 2016

I have explained in the first part the reason why I chose a Jupyter-based system; in few words: maintenance, human/data interface, python. I will now give some details on the design of the application. A prototype can be find in my github but be careful: this is still a proof-of-concept, yet a working one, that I and my teammates are using (and debugging) but still in an early stage without the polished completeness of a production-graded application. Therefore, my purpose is not here to “sell” a product that can be downloaded for immediate use but to explain the method and, maybe, encourage others to develop their own application.

The application, which is officially called Gilgamesh, is made of three components:

Gilgamesh Server

It is a personal version of Jupyter Hub, which basically enables to use Jupyter in the cloud: you connect to a login page with the web browser and you can start a personal instance of Jupyter with the dashboard as a front page. I say that this version is personal because I have rewritten the code almost from scratch using only the main mechanism (reversed-proxy/spawner) and leaving aside all what makes Jupyter Hub battle-hardened. The reason was twofold: I needed to use Jupyter Hub with Windows (and the standard version cannot because of the way process IDs are managed by Windows) and, above all, I wanted to understand how it worked. I didn’t recode all the safety systems because I didn’t need them for the proof of concept: if one process idles, I can reboot the Hub: the number of users is limited (ten) and won’t be disturbed too much by few seconds of waiting. Another reason why it is personal is that I have added some services to the Hub. Actually, you can easily add services to Jupyter Hub by using “hooks”, which are kind of access ports for external codes. But when I started, the mechanism was not clear for me and it was easier to add the services directly in the Tornado code. The main service that I have added is a centrale repository where users can push and pull their notebooks from and to their account. This is easily done because I use for storing the notebooks, not the local filesystem but a PostGreSQL database using the PGContents extension from Quantopian. The other service is the bibliography: there is a bibtex file with all useful articles, books and other documents which can be displayed in a HTML page (with the BibtexParser module and the JINJA2 template) and which can be referenced in a notebook with a small javascript extension that I have added and that converts every \citep[xxxx2016] in a hyperlink to the content of the corresponding document (a la Latex).

Jupyter Dashboard Extension


It is the Python Library that provides access to the data and to the physics models. This part is deeply dependent on the structure of the diagnostics that we have, which makes it not easily exportable for other projects in the present configuration. Yet there are several patterns that can easily be generalized. My present work is to separate this general logic from the details of the implementation of our diagnostics. The objective of the library is to give the user a high-level access to the data, without thinking of how the data are hard-wired to the captor and to give him the power of data processing libraries like pandas, sk-learn and friends. One difficulty with the high level access is to provide a seamless interface to data which are permanently changing from experiment to experiment: diagnostics can be changed, recalibrated, disconnected, reconnected, new components can be added to the testbed, and so on. It is painful for the user to keep track of all changes, especially if you are not on location. So, the idea is that the library take cares of all the details: if the user wants the current signal from the Langmuir probe, he just has to type ‘Langmuir_I’ and he will get it: the library would have found for the request experiment on which port it was connected and which calibration was applied to the raw signal. This is one step to the high level approach and it is related to the ‘Signal’ approach: you call a signal by its name and then you plot it, you check its quality, your process it. Another approach, which is complementary, is to make the signals aware of their environment; it is the ‘Machine’ approach. The testbed and its components, especially the diagnostics are modelled in Python by classes (in a tree-like hierarchy). A given diagnostic has its own class with its name, its properties (position, surface,…), its collection of signals and its methods which represent its internal physics model. Let’s take an example with again a Langmuir probe: instead of calling the signal ‘Langmuir_I’ and the signal ‘Langmuir_V’ and process them to extract the density, you just call the method Langmuir.density() and the object will do all the hard work for you. So the library makes it possible for the user to choose between the ‘signal’ approach for basic processing of data and the ‘machine’ approach to activate the heavy physics machinery to interpret at a higher level these data.

Gilgamesh Manager

This is the more classical part: a standalone, GUI-based application to manage the data. I added it as a safety net: I was not sure at the beginning how easy it would be to use the notebooks to manage the data. So I used Qt-Designer to develop this graphical layer to the Gilgamesh Library. I am not sure that I will keep this component in the future. The development of the ipywidgets is fast and makes it possible to develop some advanced interactive tools directly in the notebook. If you combine that with the Dashboards extension, you practically get the equivalent of a native application in the browser. OK, I exaggerate a bit, because it is not yet as fast and the interactive manipulation of data (like with pyqtgraph that I use in the Manager) is not as efficient but these tools are progressing quickly and I can see a total replacement in the near future. But even now, I have a notebook “Dashboard” that displays the overview of the results of the last discharge on the big screen of the control room and it is, I must say, convincing.

Jupyter Dashboard Extension

This is it: the tour of the design choices for this Jupyter-based data processing system comes to the end. Next time, I will give some return on experience on the development and operation of it. After that, we will have a look at some examples of each component.

%d bloggers like this: