The philosophical physicist

August 12, 2016

I could have called this post “The war between science and philosophy” or “the zero sum game” but I found it too childish to tackle a subject which is important for the future of physics. There was a recent update in the “discussion” of the role of philosophy in science. Massimo Pugliucci, Sabine Hossenfelder, to take the most recent insightful articles, took position on the claim that “philosophy is not useful to do physics”. As a baseline physicist (i.e. not one working on the fundamental questions of the universe), I have to react and say why I need philosophy. First, please excuse in advance my lack of clarity and of accuracy: I do not have the experience and talent of most participants to this debate. Yet I hope to convey enough of my message to make it useful.

I would first like to cut short one objection : that I am not a theoretical physicist working on “advanced subjects” like string theories, quantum loop gravity and thus I am not entitled to discuss this kind of fundamental issues. Indeed, I am a plasma physicist; I try to understand the phenomena occurring in a plasma, how it is produced, how it reacts to some stimuli. The most “advanced” tool that I use is Quantum Field Theory to calculate some in  the measurement of the plasma electric field in a magnetized plasma through Stark effect. Beyond that, I follow what happens in theoretical physics (I do not like this term because it implies a fundamental separation between experiment and theory) and I enjoy what I am able to grasp of the beauty of the constructions (as I enjoy the glimpse at the category theory or at the harmonic forms) but I have no practical experience there. Yet, I think that the reflection occurring at the level of theoretical physics affects the whole physics, whatever the domain, otherwise it would be a strong, if not deadly, blow at its coherence.

To address now the core of my ideas: as a physicist, philosophy is useful for me at two levels: first, at a practical level, because I am an human and not a pure rational machine and it is sometimes difficult to bridge the gap between the human part and the physicist part. Second, at a theoretical level, because the goal of a physicist, more generally of a scientist, is to understand the world as a whole and, unfortunately science fails at some point. Let’s examine these two points in more detail.

The job as a physicist is to apply the scientific method, which is characterized in the daily life by two characteristics: rationality and falsifiability. You take some assumptions, you derive a model from them and experimental predictions from the model, you do some tests and check if you validate or not the model. If not, you check that your chain of thoughts is rational and if it is, you change the assumption. So, basically, from the assumptions to the test/theory comparison, it is basically algorithms (sorting, pattern matching, tree traversing) in actions , except that for the moment only human brains can deal with the fuzziness of reality and the absence of clear-cut borders to the area of investigation, you can always find new ramifications to other topics and you have to expand your analysis. But computers are progressing fast and taking over a big part of this work.

But what about the assumptions, where are they coming from? By deriving them from other assumptions. Good, you see the problem. So, there is always a moment (or even several) in the day of the physicist, when all scientific methods are exhausted, where he scratches his head with a sigh. What is the practical solution there? he takes height: he tries to establish analogies with other problems, he conceives random or impossible assumptions, he drinks a coffee or goes to the theatre until the inspiration comes back. But the most effective solution is to go to the office of his colleague and discuss. And when the problem is serious (i.e. all scientific ways are exhausted), the discussion is of philosophical nature (even if not with the quality of experienced philosophers): he tries with his colleague to elaborate concepts with words. Who said that words were not accurate enough to do science? They are not as accurate as equations, but their fuzzy nature is of a lot of help when your mind is trapped by the rigidity of the equations. They give you the room to expand the mind and to discuss with your colleagues. How many scientists discuss only with equations? This is not for nothing that it is asked to reduce the number of equations in a presentation: they are a bad tool for discussion and presentations are an invitation to discussion. The philosophical discussion reduces the accuracy of the ideas but gives more flexibility and opens new areas. In this sense it is complementary of the scientific method. By the discussion (with yourself or with your colleagues) you explore new ideas and you establish new assumptions. When you come to an agreement, you apply the scientific method to them and the machine is running again.

This is also where you understand that experimental results are very useful, not only to validate or invalidate a theory, but to discuss: they are as fuzzy,  or even fuzzier, as words: the experimental between two experimental sets of data will never be perfectly linear, you will have some scattering which will invite to discussion: is it really linear? Should we add a bit of non-linearity to the interpretation? New ideas often happen from the discussion of experimental results.

This is why the scientists should be more trained to the philosophical method: this would improve their discussions and give the tools to elaborate concepts more easily before transforming them in scientific models. It will also probably improve the quality of the human relations and remind them that they are not purely rational machines (and maybe prevent some nervous breakdowns).

The second level of interest for philosophy is more fundamental. There is a point where the scientific method does not work when you try to understand the world when you live. Actually, it breaks for most of the daily issues (except if you live in a lab or your name’s Sheldon): your relations with the society, politics or your love affair. You can write a numerical model of your relation and test it. If the test fails, it will not be possible to change the model! Facing this situation, either you just live your life or, if you really want to understand, philosophy is the only possible rational way to approach the problem. This is only what you can do when you meet the absurd, as defined by Albert Camus in the Myth of Sisyphus: the absurd arises when the human need to understand meets the unreasonableness of the world, when “my appetite for the absolute and for unity” meets “the impossibility of reducing this world to a rational and reasonable principle.”. The worst moment for a scientist.

Of course, you can say that, in the end, physics will explain everything (we could discuss that, personally I am not convinced, not with the present tools), we are just limited for the moment by our ignorance. Sure but now is the moment where we live and if we want to avoid too much frustration, we have to use all possible rational tools to quench our thirst of knowledge or, for the least, to deal with the world.

 


About Drupal

July 19, 2016

Our plasma source project involves several teams across Europe. We wanted a centralized source of information remotely accessible. Our idea was to have an intranet where we could store the documentation, the to-do lists, a gallery of pictures and videos. And we needed a solution which fast and easy to deploy. After so quick trade-offs, we chose Drupal, which is based on a classical html/php/MySQL stack.

screen

The big advantage is that, indeed, you get a polished solution very quickly, I mean in few weeks. Everything is controlled through the integrated administrator’s GUI and the online documentation is abundant. Its use is smooth and I had a very low down time.

So if you just want an intranet with standard features, Drupal is really the right solution. Yet, in parallel, we have developed our data processing system Gilgamesh which is based on Jupyter and thus on Tornado in Python. As a result we found ourselves with two systems with different architectures. Of course, they have different purposes but for some applications, it would be interesting that we have bridges between the two systems. For instance, in Gilgamesh, you can make references to papers in Latex style; it would be useful to reference documents which are in the Drupal system. In theory, it should be possible since the document reference is saved in a MySQL database and the document itself in the filesystem. But the architecture is so different that, practically, the interface is a nightmare to develop.

Therefore, in the future, and for the next project, I will avoid Drupal and start any Intranet on a Tornado solution. In this case it will be easier to integrate it in more complex systems like Jupyter.

 


JupyterLab: first review

July 15, 2016

A pre-alpha version of JupyterLab has officially been released: you can see the details of the reasons and advantages on the Jupyter Blog and on the Bloomberg Blog. You will find there the slides and video of the Scipy2016 talk.

I wanted to give a first review of this new version of Jupyter. I have indeed installed it for our Gilgamesh Data Processing System and tested a little bit.

There are two parts: the user view and the developer view.

JupyterLab for the user

The first feeling at the start is that you get a clean desktop application in your browser: you have several movable panes, you have icons to start the application you need: notebook, console or the about panel. And you have the file manager which is FAR better than the Jupyter dashboard: you can move the files between the folders, you can drag and drop. It is very practical. You have easily accessible help pages and you can move your notebooks or consoles in panes side-by-side.

Graphically, it is not yet finished: I find the color scheme a bit dull. But following the activity on GitHub, the designer are working hard on improving that.

There is one usability issue in my opinion: the menu with the commands. Why is it on the side of the file manager outside the notebook? It is not intuitive at all.

As for the notebook itself, I am not quite sure but I have the feeling that the display is a bit slower than the classical notebook. But this is to confirm on a daily use. And in any case, it does not disturb the manipulation of the cells.

Thus, we have there a useful product with clear improvements other Jupyter. There are glitches but we have to keep in mind that this is only a pre-alpha release. It is already a high level of quality for such an early release. In addition, we have to understand the philosophy of JupyterLab: it is not and end product, it is an infrastructure to connect your plugin and develop your own product tailored to your needs. This is why it is important to see what is under the hood.

JupyterLab for the developer

First a note of caution: I am not a high level  front-end developer; so this review is based mainly on my comparison with the front-end of the standard version of Jupyter.

The main idea to note: JupyterLab is a front-end: there is not a single part of the code that changes the python server side (based on Tornado). So basically you can run Jupyter and JupyterLab on the same instance of the server (you just redirect to the right webpage to get interface you want).

It is based on TypeScript and on the PhosphorJS which provides widgets (menus, frames,…), messaging between objects and self-aware objects a la traitlets (when their properties change, they fire signals). The result is a very clean structure, modular and logical. You build your application by assembling plugins and widgets. The communication between them is almost automatic (almost!). The communication with the Jupyter server is realized through the jupyter-js-services API (which still is a bit confuse in my opinion, but this is more related to my limited abilities in JS programming).

What I have no tested yet is the use and development of ipywidgets and how the backbone architecture is integrated in the JupyterLab architecture. But I think it can go only in a better direction.

To conclude JupyterLab offers a set of front-end tools to easily modify or extend the Jupyter notebooks: if you don’t want a console, you can remove it, or you can add your own one, you can add notebooks with special layouts (for presentation or dashboards) or you can imagine more exotic plugins. For instance, for Gilgamesh, I am developing a plugin for a kind of “JupyterTalk”: the notebook is not saved anymore as a file but in a database. Several users can connect to it, each having his own kernels and type their own cells (each cell is identified by the username). But the display is common to all users: you see your cells and the cells from other users. So you get a chat with a succession of messages, which are more than text but real Jupyter cells (in markdown or code) with their output. So you can have a discussion like in a chat but with the power of a kernel behind to display data, run algorithms.  This is something made possible by the flexibility of Jupyter. You can have Augmented Discussions.

Conclusion

JupyterLab is the next step on the way to develop an ecosystem instead of a simple application. This looks like a bright strategic development and I am eager to see what will come out the imagination of the community. I think there are many possibilities open far beyond the notebook. JupyterLab is a new layer above the Operating System: it is the Computing System in charge of connecting the user with his kernels to support and enhance his work. Kernels can be languages but also interfaces with hardware (a python kernel on a Raspberry Pi can give access to the GPIO ports and the associated peripherals). Therefore it will offer to your narrative computing access to data,algorithms and hardware. Very promising. Good Job Jupyter developers!


Jupyter in real life – Part 3: return on experience

July 5, 2016

I have presented in the previous part the design of our data processing platform. The launch of the application was progressive with at the beginning only two beta testers; now I have eight regular users and I plan a maximum of 15 participants (please remember that the platform was initially designed for a small team). So I have now a bit of experience with running a multi-users Jupyter system and learnt of the advantages and issues related to the method. This is what I want to present now.

Technical choices

I am still hesitating about two choices I have made for the processing library: HDF5 (via h5py) and Pandas. I am not sure if they bring more advantages or more drawbacks.

  • For H5py (but it is basically the same for PyTables): it provides a clean API to save in a hierarchical way your raw data. Your data come from the diagnostics and you can put in nicely prepared groups, subgroups and metadata. As far as I understand,HDF5 is supposed to deal with huge files: you are supposed to put all your experimental data in the same file; it is conceived as a replacement of the traditional directory tree of your filesystem. I didn’t do that because my natural instinct fears big files and what happens to them if they get corrupted. And some of them have already been corrupted: [so it happens] (http://cyrille.rossant.net/moving-away-hdf5/). By writing a file for each experiment, I lose the advantage of manipulating in one block the metadata of each experiment. Let’s say that I want to compare the maximum magnetic field from experiment to experiment; I have to open each file, read the magnetic field, close the file, open the next one and so on. With one single file, I would have simply iterated on all groups. To circumvent this problem, I have established a parallel database that gathers all metadata. It is far from being the ideal solution; when I change metadata, I need to do the writing operation in double: once in the hdf5 file and once in the database. Another issue with hdf5 is that it is ideal for frozen data structure: you get raw data and you “freeze” them in a hdf5 file. But as soon as you want to modify these data (for example, you want to add level-1 processed data), it becomes to be unclean. And to finish, the API is not suited for concurrent writing: I need to impose one administrator who is the only one allowed to write in it. OK, again, for raw data it is not a problem, but as soon as you want people to add processed data to these files, it becomes just painful. I have no ideal solution to these issues. Looking around, the general solution is based on the standard filesystem. I am still not sure this is the right way either, especially to manage metadata associated with each signal.

  • For Pandas, I am also in doubt. It is really powerful to aggregate data (you need one single line to get the average and standard deviation or other attributes of a time series and display it for several experiments). But there are many cases where you have to reverse to numpy arrays and it adds long expressions in your python code. Moreover, access to a single point in a dataframe also requires a circonvoluted style.

There is also a more fundamental point: how to manage the API. I took the obvious solution to put the API (all the functions specific to our experiments like plasma models) on the server where the IPython kernels are running. So each kernel has access to this API. Main advantages: it is centralized and all changes are reflected to the users immediately; you know that these users all have the same models and the same functions. But the solution comes also with drawbacks: this is research: the models evolve quickly: the underlying functions have to follow these changes. But an API has to be stable, otherwise it is not usable. How do you sole these opposite constraints? I have no clear-cut answer: sometimes I have to change the functions and the associated parameters and it breaks the existing notebooks. Sometimes, I create new functions. But it is not very clean. In addition, the access to the content of the API, the source code is not easy; you can use a magic command to do that, but it doesn’T give you a very nice display. A more beautiful idea, which I am implementing, is to use the notebooks as the support for the APIs. Basically, you write all your APIs functions in a set of notebooks (with the great advantage that you add text and pictures or whatever necessary to explain your code and your models) and you put these notebooks in the central repository. Now you can create a notebook, and instead of loading a python code with the import, you load the API notebooks like a module. You can even affect version numbers to the APInotebooks, so that you can keep the compatibility when your API is evolving: you just have to call the right version of the API. You can also copy an API notebook, modify it to add some functionalities and, when these changes are validated, you can share it with others on the central repository. One step further is to use these API notebooks to provide web services.

Usability

Jupyter in teamwork is great: you write a notebook, you transfer it to your teammate, he can execute it just as it is: you have the same data, the same API; he can exactly do what you did and correct or improve your work. The principle of narrative computing is also very helpful: you can comment, explain with images, figures, whatever you need to your team. This really improves the communication and debugging of problems in code but also in physics models. In addition, the seaborn module really brings a decisive visual gain over classical tools. There is a big way for improvement and, in my opinion, the future is really bright provided that we bring these improvements to life. I will talk about them at the end. But even when the solution you propose clearly brings big advantages, it is not enough to make it available to the users without a strong advertising and a strong technical support. In all cases it requires time to impose it as the reference choice to do data processing. In the first days, the most used function was ‘export’ which makes it possible to transfer data to other tools like matlab. Several actions are necessary to reverse the trend: to propose notebook tutorials, an in-depth documentation and in-person training. You choose first the early adopters, the users who are ready to test new products (and there are not so many of them), you run together through some examples, you make some comparison with his previous codes and progressively push him to stick with your solution.

Other good points are the widgets and the dashboard extension: you can add an interactive part to your notebook, which simplifies like in several situation. Many widgets are available, you can adapt them to your needs or create new ones. Once you have working examples, it is rather straightforward to make a new one (it is more difficult to make a nice one! Frontend physicits are welcome). So you can publish an overview of your last experiment on the big screen with all important parameters; or you can display a list of experiments and select the one where you can get the plot of the main parameters. This is really useful. The layout possibilities are for the moment lacking a bit of flexibility; maybe I do not use it in the best way or the code is still in its infancy. But it can only become better (but [some will say] (https://www.linkedin.com/pulse/comprehensive-comparison-jupyter-vs-zeppelin-hoc-q-phan-mba-) that it will be difficult because of the old technologies used; old meaning here not [angular.js] (https://angularjs.org/)). In this sense, you can have a look at JupyterLab, which could be the future version of Jupyter: the frontend is entirely rebuilt from scratch based on TypeScript and PhosphorJS, which gives a cleaner code, and an awesome desktop-like application UI.

But let’s go back to the present version: at one point, you will get on your account plenty of notebooks, some in classical narrative fashion, other with the dashboard aspect. And here we reach a present limitation of Jupyter: the management of notebook in the tree dashboard is awful: you can duplicate and delete, that’s it. Normally, Jupyter notebooks are stored on the local filesystem and the user can manipulate all his data with the native file explorer. But in our case, with a database filesystem, it is not possible: Jupyter has to integrate a full-fledged file manager. JupyterLab will have it but in the meanwhile, the maintenance of proper shared set of notebooks is difficult.

Future step

I am really satisfied by the result and how Jupyter, with a central data API, really improves the research workflow. I see one direction of long-term improvement which can radically change the way to do experiments. For the moment, Jupyter is used only to process the data. The configuration of the experiment and the setup of the experiment is done on a dedicated software (in our case Siemens WinCC) through a graphical interface which is our interface to the hardware (a Simatic). Now imagine that you can install and develop a kernel for your signal controllers and monitors. Let’s say that you have a rack of Raspberry Pis, Arduinos, RedPitayas with one of them used as a supervisor. You can install a IPython kernel on it with an API which defines the hardware logic (how controllers and diagnostics are interrelated, dogwatchers, control loops and so on, with the RedPitaya you can even have a FPGA part for fast processing^) and offers a set of commands to access this hardware with a given configuration. This kernel can be accessed from a Jupyter with a notebook, thus offering large possibilities: the most classical one would be to write ipywidgets to get back the usual GUI with knobs and displays. But we can imagine more interesting solutions: instead of writing on a paper your experimental protocol and entering the corresponding program in the interface, you can create code to let the computer establish itself the experimental sequence. Let’s take a concrete example: we want to see how the plasma density is evolving in function of the operating parameters (power, magnetic field, pressure). We can define by hand the series of tests and the way each parameter will evolve. It is not straightforward because the effect of the operating parameters depends on how you make them evolve during the test. So, you have to check in the previous experiments how they correlate and establish which sequences are the best (ramp in power first, then ramp in magnetic field, then gas injection for instance). Now, since you have both the data, the controller and the computing power available in your notebook you can try to automate the sequence: you train your neural network on the previous sets of data to highlight the interesting patterns for your objective and then you apply this pattern to the next discharges. If you get the results you want, good; otherwise, you use these new results to improve the controller. Yes, you are in a closed loop with the computer having access both to the inputs and the outputs, the ideal case for machine learning. And experimentalists were thinking that their job would never be threatened by machines!


Jupyter in real life – Part 2: design

July 5, 2016

I have explained in the first part the reason why I chose a Jupyter-based system; in few words: maintenance, human/data interface, python. I will now give some details on the design of the application. A prototype can be find in my github but be careful: this is still a proof-of-concept, yet a working one, that I and my teammates are using (and debugging) but still in an early stage without the polished completeness of a production-graded application. Therefore, my purpose is not here to “sell” a product that can be downloaded for immediate use but to explain the method and, maybe, encourage others to develop their own application.

The application, which is officially called Gilgamesh, is made of three components:

Gilgamesh Server

It is a personal version of Jupyter Hub, which basically enables to use Jupyter in the cloud: you connect to a login page with the web browser and you can start a personal instance of Jupyter with the dashboard as a front page. I say that this version is personal because I have rewritten the code almost from scratch using only the main mechanism (reversed-proxy/spawner) and leaving aside all what makes Jupyter Hub battle-hardened. The reason was twofold: I needed to use Jupyter Hub with Windows (and the standard version cannot because of the way process IDs are managed by Windows) and, above all, I wanted to understand how it worked. I didn’t recode all the safety systems because I didn’t need them for the proof of concept: if one process idles, I can reboot the Hub: the number of users is limited (ten) and won’t be disturbed too much by few seconds of waiting. Another reason why it is personal is that I have added some services to the Hub. Actually, you can easily add services to Jupyter Hub by using “hooks”, which are kind of access ports for external codes. But when I started, the mechanism was not clear for me and it was easier to add the services directly in the Tornado code. The main service that I have added is a centrale repository where users can push and pull their notebooks from and to their account. This is easily done because I use for storing the notebooks, not the local filesystem but a PostGreSQL database using the PGContents extension from Quantopian. The other service is the bibliography: there is a bibtex file with all useful articles, books and other documents which can be displayed in a HTML page (with the BibtexParser module and the JINJA2 template) and which can be referenced in a notebook with a small javascript extension that I have added and that converts every \citep[xxxx2016] in a hyperlink to the content of the corresponding document (a la Latex).

Jupyter Dashboard Extension

Gilgamesh

It is the Python Library that provides access to the data and to the physics models. This part is deeply dependent on the structure of the diagnostics that we have, which makes it not easily exportable for other projects in the present configuration. Yet there are several patterns that can easily be generalized. My present work is to separate this general logic from the details of the implementation of our diagnostics. The objective of the library is to give the user a high-level access to the data, without thinking of how the data are hard-wired to the captor and to give him the power of data processing libraries like pandas, sk-learn and friends. One difficulty with the high level access is to provide a seamless interface to data which are permanently changing from experiment to experiment: diagnostics can be changed, recalibrated, disconnected, reconnected, new components can be added to the testbed, and so on. It is painful for the user to keep track of all changes, especially if you are not on location. So, the idea is that the library take cares of all the details: if the user wants the current signal from the Langmuir probe, he just has to type ‘Langmuir_I’ and he will get it: the library would have found for the request experiment on which port it was connected and which calibration was applied to the raw signal. This is one step to the high level approach and it is related to the ‘Signal’ approach: you call a signal by its name and then you plot it, you check its quality, your process it. Another approach, which is complementary, is to make the signals aware of their environment; it is the ‘Machine’ approach. The testbed and its components, especially the diagnostics are modelled in Python by classes (in a tree-like hierarchy). A given diagnostic has its own class with its name, its properties (position, surface,…), its collection of signals and its methods which represent its internal physics model. Let’s take an example with again a Langmuir probe: instead of calling the signal ‘Langmuir_I’ and the signal ‘Langmuir_V’ and process them to extract the density, you just call the method Langmuir.density() and the object will do all the hard work for you. So the library makes it possible for the user to choose between the ‘signal’ approach for basic processing of data and the ‘machine’ approach to activate the heavy physics machinery to interpret at a higher level these data.

Gilgamesh Manager

This is the more classical part: a standalone, GUI-based application to manage the data. I added it as a safety net: I was not sure at the beginning how easy it would be to use the notebooks to manage the data. So I used Qt-Designer to develop this graphical layer to the Gilgamesh Library. I am not sure that I will keep this component in the future. The development of the ipywidgets is fast and makes it possible to develop some advanced interactive tools directly in the notebook. If you combine that with the Dashboards extension, you practically get the equivalent of a native application in the browser. OK, I exaggerate a bit, because it is not yet as fast and the interactive manipulation of data (like with pyqtgraph that I use in the Manager) is not as efficient but these tools are progressing quickly and I can see a total replacement in the near future. But even now, I have a notebook “Dashboard” that displays the overview of the results of the last discharge on the big screen of the control room and it is, I must say, convincing.

Jupyter Dashboard Extension

This is it: the tour of the design choices for this Jupyter-based data processing system comes to the end. Next time, I will give some return on experience on the development and operation of it. After that, we will have a look at some examples of each component.


Jupyter in real life – Part 1: specs

July 5, 2016

Jupyter is the reference in terms of notebooks. Its principle of narrative computing offers many advantages, but the most common application is related to education (see for instance this list of notebooks which are mainly tutorials). The ability to follow step by step a calculation, and to do it by ourself is, of course, already a big help to understand a subject. Yet, I am convinced that the notebook, and the evolution that it is presently following, can also play an active role in research and production. I want to share in a series of posts one particular application of notebooks with the concrete example of our testbed, with the hope that it can convince other people to use it or to tell their own experience in the area of research.

In this first post, I will explain why I have chosen Jupyter over more classical methods for data sharing and processing.

The need for a data platform

I run a middle-sized experiment worth of several hundreds of thousands of euros which aims at producing a helicon plasma and at analyzing its interactions with radio-frequency waves.

Ishtar Testbed

Despite its limited size, the experiment involves several teams distributed over several countries in Europe and plans to extend to cooperation on other continents. The idea is to have a shared experimental platform accessible for whoever wants to carry out measurements on this kind of plasma source, with a friendly plug&play interface for diagnostics and easy access to the data. In brief, this should be a 21st century way to do “cloud experimenting” within a modest budget. In a less emphatic and more concrete tone, my need was the following: all data transit through Labview (I will explain -but not defend- this choice on another occasion; in short: time constraint). They come raw; I want to apply all calibrations, meta-data stamping and make them accessible on another more flexible and cost-effective system. In addition, I would like that the users have access to the configuration of the testbed, so that they know what kind of hardware was present when these data were acquired.

Distribute data, but in a meaningful way

My main concern was to make the data available to the distributed team. My first idea was rather classical with the development of a data server: basically, the data are stored on a computer with, for instance, a http server and each user can connect either through a web browser or through a dedicated client to this server to display the list of experiments and the associated data and download them. Since I wanted to use Python anyway (because it is, in my opinion, the best suited language for this kind swiss-knife operations of data and metadata manipulations), I was thinking of implementing a tornado server like the HDF server. It could have been thought as an extension of our present intranet but the implementation would have been difficult since this intranet runs on apache-php-drupal (it was a fast and efficient solution but not the most appropriate on the long term but this is another story) or as standalone. Another solution could have been to use something like Tango which is used on big experiments likeSardana but since we already had our own control system, it would have been overkill. So the server version seemed the most suited to our requirements.

Not the obvious choice

Yet I was not convinced by this choice for several reasons:

  • Each experiment contains several GB of data and we can have up to 100 experiments each day. Not all data are relevant but we are still at a stage where we don’t know exactly how to clean them. This means that people will tend to download a huge amount of data, just to process a small part of them; I did not have enough bandwidth to support useless data transfer; I wanted a more economical way to deal with data,
  • if a dedicated client is used, it is faster than a web browser but we would have to update it with each evolution of the database and to be sure that each client computer has the right versions for python and the different modules. In a collaboration where people come and leave often, it could become very time-consuming to check that every user is equipped with the proper tools; so I wanted a solution where I can centralize the maintenance,
  • The fact that the team is physically distributed means that everybody will work out the data his own way, with his own tools and his own models. So, in addition to sharing the data, we would have to develop and install tools to share the numerical tools, the physics models, to improve the communications. This is what is done in most collaborations, but it is probably not optimized and there is room for enhancement; I wanted here to try new solutions,
  • Finally, I am very convinced that notebooks are the future of doing data processing and computing. Its brings a huge improvement in human/computer interface, with a nice, easy way to explain what you are doing in your calculations, or how to use the data. It is particularly useful for a collaboration with temporary members (students, short-term participants). They can follow your steps and understand how to process the data with a very smooth learning curve. In addition, you shorten the path between the retrieval of data, their processing and analysis and the publication. All in all, notebooks maximize your time dedicated to the creative part. I wanted to use this killer feature and see it working in real conditions.

This is why I decided to give a try to the Jupyter-based solution. It opens many interesting perspectives, even though some hurdles still need to be overcome. This will be the subject of the next post where I will detail the design choices of this solution, with a more emphasis on the code.


The Pelican Experiment: the end

July 5, 2016

Sometimes I need to be pragmatic even if it means that I have to give up a project where I have invested a lot.

I have tried to move the blog to Pelican. I have explained my main reasons here. After several weeks, I must admit that it was not a good idea. Not that Pelican is a bad product, on the contrary, it is a nerdy fun to play with; but it is not suited to my needs and configuration. It is better for people with a single development computer who regularly write on their blog and make a vitrine of it. My blog is more a kind of public area where I put my ideas, links and projects, a way to express my thought. I don’t want to compile, commit and push each time I need to put some words online. In addition, I was bothered by the absence of interactivity.

So, basically, I come back to WordPress. I will move back the few articles I have written there.


Follow

Get every new post delivered to your Inbox.

Join 655 other followers

%d bloggers like this: