Feb. 2021 status: Dark mode and organisation plans

This is a new format of post we are starting on the blog: publishing the monthly updates that were until now only circulated in the internal XWiki newsletter. This will be an opportunity to regularly catch up on new features, research projects, funding/budget updates, and any other relevant news.

FOSDEM presentations

Aaron and David presented different aspects of CryptPad at FOSDEM 2021. Please see the updated blog post for videos of both presentations.

Dark mode

This month we followed up on the rebranding started with version 4.0 by thoroughly refactoring how styles, especially colors, are applied across CryptPad. This allows for better maintainance and easier customisation. The first custom theme is the long requested dark mode.

CryptPad will now follow the browser or operating system preference by default, and switch to a dark theme accordingly. The theme can also be set manually in Settings > Appearance.

The CryptDrive in dark mode

Following the introduction of the dark theme in our 4.2 release, we noticed a few problems and got to work on correcting them. The most noticeable issue was the use of a dark background for rich text documents. Wanting to offer a “true” dark mode, we had intially switched the editor itself to a dark background, and made the default text color contrast with that automatically. It soon became apparent that this was a problematic choice in rich text documents where users are able to set colors for text. It may lead to text being un-readable depending on the theme used. One particularly painful example was a document about making web content accessible written in black text on a dark background. We reverted our decision and opted for a light background in the editor even when the rest of the interface is dark. There is a reason mainstream editors such as Microsoft Word do it this way. You can expect more polish on the dark theme in the upcoming 4.3 release.

Web accessibility guide shown with black text on a dark background

The web content accessibility guide that prompted us to revert our decision on the dark theme rich text editor. The guide is by AccessiBloc.

Organization plans

On cryptpad.fr, another long-awaited feature were the Organization plans. We have been communicating this pricing on request for the last few months but the plans are now live in the cryptpad.fr interface. These bigger plans have the additional option to download a personalised signed Data Processing Agreement (DPA) for organsations that need to demonstrate they operate according to the GDPR.

These plans come with 1 business day support and increased storage shared between a number of user accounts, priced as follows:

  • 25 Users with 100GB of storage for 500€ a year (ex. VAT)
  • 100 Users with 150GB of storage for 1000€ a year (ex. VAT)

An additional On Premises option is available for organizations that require their own CryptPad instance, with installation and maintainance support by the development team.

We were happy to welcome the first couple of subscribers on these plans and hope that they will contribute to making CryptPad financially sustainable in the longer term.

The new organization plans on cryptpad.fr in dark mode.

We had an unexpected spike in traffic early in the month after the following tweet linked to a toolkit made on cryptpad.fr.

This brought a lot of traffic to the service, as illustrated by the spike below. While this was a surprise, our infrastructure was prepared for it and held up very well.

graph showing a big spike in visits to CryptPad.fr

Delivered: NGI Trust project: Secure Mobile Collaboration

We have wrapped up this exploratory project about using CryptPad on mobile devices. There will be dedicated posts about this project in the near future. This project allowed us to scope out, in depth, the options available to make CryptPad work as an “app”. As a summary of our findings, here is what we plan to include in the new Frequently Asked Questions section of our documentation that will be part of the next release:


FAQ: Are you planning a mobile app?

We are not planning a dedicated mobile application for the following reasons:

  • It would dramatically increase the amount of code that has to be developed and maintained, effectively creating other “versions” of CryptPad for iOS and Android.
  • CryptPad is open source and can be hosted by anyone who wants to offer the service. Therefore, users of a mobile application would have to specify which CryptPad instance they want to connect to, which would be confusing. To complicate things further, each instance may be running a different version of the software, depending on whether or not the latest updates were applied by the administrators.

To address these problems, the development team is working on making CryptPad a “Progressive Web App”. This means that it can be used on mobile through the web browser, behaving like an application while being the same software that runs on desktop browsers. This has the benefit of turning every CryptPad instance into a web app provider, rather than putting the burden of choosing the right instance on the user.


This approach has already started to inform new developments for CryptPad, for example the use of IndexedDB for caching documents which is already deployed. Further improvements will follow, including a full “offline” mode.

This wraps up our first monthly status post. In March we will be shifting back to our NLNet Communities project and attempt to finish the outstanding deliverables around documentation for developers and instance administrators.

CryptPad at FOSDEM 2021

(this post was edited on 24th Feb. 2021 to include links to videos and corrections)

The CryptPad team is taking part in the 2021 online edition of FOSDEM. We will use this opportunity to reflect on the past year from a couple of different perspectives.

Aaron MacSween’s presentation is about the technical challenges faced by the team this year. The massive influx of users working from home pushed us to scale CryptPad to accomodate an additional 60K weekly active users. This was made easier by the platform’s unique architecture, where most of the “expensive” work involving cryptography happens on the client rather than the server. Additional challenges involved a 27 hour outage due to a cooling malfunction at our hosting provider. While the outage itself was out of our control, it brought into sharp relief that our procedures to mitigate uncertainty had not scaled with our user-base. Aaron will speak about what we plan to do to avoid such situations in future.

In the design devroom, I will reflect on my first year as the designer on the CryptPad team. My work has been spread across many different areas, from UI design to answering support tickets, writing the product and documentation, as well as visual identity. All of these elements boil down to one thing: communication. I will show some examples of work produced this year as attempts to improve how CryptPad communicates, from onboarding to daily-use. I will conclude with one of the challenges for the year ahead: accessibility. Communication is all well and good, but of no use if it cannot be heard on a screen reader.

Talks are pre-recorded and will be aired on Saturday 6th February. For more information, abstracts, and broadcast time with Q&A session, see the indications below.

This blog post will be updated with video embeds once these are available.

Living on the edge with CryptPad

  • Speaker: Aaron MacSween

Due to unforeseen circumstances, Aaron was unable to include his presentation in the FOSDEM track. However he still recorded it so we are making it available here and on our PeerTube channel.

Watch on the CryptPad Peertube channel

Communicating CryptPad

Watch on the CryptPad Peertube channel

No plan survives first contact with the enemy

In 2019 we finished a four-year research project that had covered the majority of CryptPad’s development costs. We had some worries about how we would continue to fund our team, but we were fortunate enough to meet and form a good relationship with members of Europe’s Next Generation Internet Initiative.

We received 50000 Euros from NLnet as a part of their NGI0 Privacy Enhancing Technologies grant program. Though we’d planned to finish this project (CryptPad Teams) before the end of 2019, research projects at this scale require a faster pace than we were used to. We’d had an intern join our team over the summer, our plan didn’t really account for vacation days, another salaried worker joined our team in November, and in general there were just many distractions that made everything take a bit longer than expected

We mostly made up a lot of the difference with an increasing number of donations and subscriptions via our premium accounts portal, and we had written a number of new grant proposals for the coming year. Our second NLnet proposal (CryptPad for Communities) had already been accepted, but we were waiting to sign the final contract before making any announcements. So, with 2020 on the horizon I wrote an article which alluded to our plans while we waited to hear back about which of our remaining proposals would be accepted.

2020’s projects

In early 2020 we were still finishing up the final components of CryptPad Teams. In addition to the remaining technical features we were also required to complete two audits of the platform: one to assess CryptPad’s accessibility and another quick scan of its security features. We didn’t really know how long these would take, and we hadn’t budgeted additional time for them, so these delayed our other projects and added a little bit to our 2019 deficit.

We already knew to expect another 50000 Euros from NLnet for our Communities project, but since the status of our other proposals was still uncertain we decided to attend the Open-Source Speed Dating session at FOSDEM. Two of our team members pitched a project to speed up CryptPad’s page loading times, making for a total of three pending proposals.

As it turned out, we heard about all three projects in the space of a few days and all three were accepted. We weren’t expecting all of these proposals to be successful, so we had to adjust a lot of our plans to ensure we could manage all of their respective deadlines, but on paper it all seemed manageable.

CryptPad for communities

We’d already begun working on Communities‘ features quite early in the year. The project included a number of high-level themes, but the overall goal was to make it easier for groups of various sizes to adopt or transition to CryptPad instead of proprietary alternatives.

Firstly, we’d heard from small businesses and social initiatives that they wanted to use CryptPad but needed some new features before they could make the switch. We made major changes to our Kanban, rich text, and spreadsheet editors.

CryptPad’s admin panel, which used to be very limited, now features a variety of controls for adding or modifying quotas for particular users, along with a variety of other configuration options to make it easier to run your own CryptPad instance. We still need to add the ability to restrict registration and unregistered usage, but we expect to deliver this in early 2021.

Finally, we launched our documentation platform, which is available in English, French, and (courtesy of some dedicated contributors) German. There is currently only a user guide, but we’ll soon offer a thorough installation guide for admins and some technical documentation for contributors.

Secure Mobile Collaboration

The goal of this project was to experiment with different technologies and ultimately prototype some dedicated mobile and desktop apps for CryptPad. Our intent was to make CryptPad usable on mobile devices while also improving security by distributing static builds of our source code with cryptographic signatures so their authenticity could be verified.

We pitched this project to NGI TRUST at the end of November 2019 and framed it as an experiment since we weren’t sure we’d be able to maintain dedicated apps in addition to the web platform we already offer. Nevertheless, we know that mobile support is important to our users and we wanted to dedicate time to investigate our options.

We expect to finish this project soon but our approach has diverged from its early goals in some very notable ways. For now I’ll just say that a lot of time and effort has gone towards addressing the intended problems and that you can expect a dedicated blog post or two about this in the near future.

Dialogue

Not long after proposing Communities to NLnet we pitched this third PET project. It can take several months for these proposals to pass through their various stages of review, and each project only funds our team for part of the year, so it’s important that we line up our next project before the current ones finish. At the same time, we can’t (legally) get paid by multiple funding bodies for the same work, so we need to ensure that projects don’t overlap.

We applied for this and the NGI TRUST grant concurrently, but we didn’t expect to win both. NLnet’s deadlines are considerably less strict, however, so we’ve prioritized SMC and saved Dialogue for the coming year. All NGI0 PET projects have to be completed by late 2021, so we expect this to be our last.

CryptPad is currently specialized mostly for real-time document editing, and our cryptographic permissions system reflects that. The main idea behind this project is to develop a new set of applications with different permission schemes that support more granular permissions for document components instead of all-or-nothing permissions for whole documents.

We already offer a poll application, but it uses the same editor/viewer roles as our document editors, which really doesn’t match users’ expectations. This current implementation will be phased out in favour of the new scheme to support distinct roles for authors (who can ask questions and determine who can answer them), responders (who can submit answers), and viewers (who can see responses). We’re also going to add support for more complex surveys with multiple questions, implement a reminder system to notify authors and viewers when their polls have closed, and add some more instance admin functionality so that we and other people hosting CryptPad can communicate with their users via the existing notification system.

MOSS

The requirements of Mozilla’s Open-Source Support program were considerably less formal than those of NLnet and NGI TRUST. We received 10000 USD, which converted to about 9000 Euros at the time we received it, and we promised to use it to improve page loading times. There wasn’t any contract or formal definition of how we’d planned to do this, and no deadline given.

This funding model was extremely helpful for us this year and did a fantastic job of living up to its name and goal of supporting open-source. Our European funding partners provide all or most of their financial support as their deliverables or the entire project are completed. By contrast, MOSS solved some immediate cash-flow issues during this difficult year and afforded us the flexibility to fulfill our promises in between our other deadlines.

So far we’ve followed up on these goals by profiling page loading times on different devices to determine where to best spend our efforts. We’ve made a number of small optimizations on the client along with some big server improvements that were frequently the cause of bottlenecks when establishing a new connection to the server. There’s still much more to do in this regard, and we plan to post ongoing updates as we find more room for improvement.

A year of surprises

With the exception of our MOSS grant, everything I’ve mentioned so far was planned and proposed late in 2019. We’d set our objectives for 2020 early on and had carefully considered how we could coordinate our multiple projects and how their features could complement each other. As you might imagine, very little went according to plan.

i vaguely recall a few headlines about a respiratory illness being discovered in China late last year, but I didn’t give it much thought and obviously didn’t foresee the impact it would have on our plans for the year, let alone everything else it affected. As the epidemic became more widespread, was upgraded to pandemic status, and triggered lockdowns across the world increasingly more people moved to working online. Previously, I was happy with our success when we saw ten to fifteen thousand users in a week, but those numbers quickly doubled, tripled, and quadrupled in a matter of months as offices and classrooms started relying heavily on our platform.

Unique IPs per visiting CryptPad.fr per day

We made some significant changes to our server code to keep up with demand and eliminated some of our client’s code that was particularly expensive for the server. The precise technical details of exactly what we did to adapt to the dramatic increase in usage deserve their own article, but in general we suddenly had to pay a lot more attention to our infrastructure than was previously the case. We started regularly allocating more disk space to the server and, as 2020 ends, we now store more than six times more user data than we did this time last year.

One major lesson we’ve learned, however, is that it’s been far easier to scale our infrastructure than manual support for the platform. Our surge of new users came along with a matching increase in support tickets, emails, GitHub iissues, and questions on social media. We prioritized the documentation that we were writing as a part of our Communities project, however, we still had to take time to answer the questions of people who hadn’t found those docs or whose questions were not clearly answered therein.

We’re still working to streamline this process, but our ability to respond to individual questions is a frequent bottleneck for our team. This typically makes it more difficult to stay on top of our usual development cycle, and leaves less time than we’d like for promoting the project via public events or blog articles. Having too many users is a fantastic problem to have, though, so this is less a complaint and more an acknowledgement of a challenge that we need to address. We can’t afford to be just a team of software developers that also happen to maintain and support a platform when both activities are equally important to our continued success.

Conclusion

After the year we’ve had it’s tempting to view the future as increasingly uncertain, but the reality is that nothing was ever certain to begin with. We’re still making plans for 2021, but our plans now include more caveats and fallbacks to (hopefully) lessen the impact of whatever else we don’t see coming.

With all the unexpected stress of this year it’s difficult to remember the good things, but we’ve had an incredible increase in support from our users. Contributors have helped to add some significant features to the platform this year and have translated CryptPad into a number of languages. In the past two months subscriptions and donations have covered one of our three team members’ salaries. Our yearly revenue has once again more than doubled compared to the previous twelve months, and if these trend continues we’ll be able to fund our current team’s salaries without having to depend on grants.

There’s a lot more to be said about our goals for the future, but we still have a number of projects to complete, so for now I’ll prefer not to think too far ahead. Instead, I’ll leave you with a bit of a teaser for our upcoming 4.0.0 release…

CryptPad 4.0.0, coming in January 2021!

Thanks so much to everyone who’s supported us in any way throughout this difficult year.

We wish you all the best in 2021!

Recent interviews and presentations

CryptPad and the team have received some attention recently through various channels: a Reddit AMA thread, podcast, presentation, and blog interview. Whether you prefer to read, watch or listen there is a way for you to get an up-to-date presentation of CryptPad.

Aaron MacSween was invited for an Ask Me Anything (AMA) thread on r/privacy on Reddit. The thread started on November 27th 2020 and lasted for a couple of days. 98 questions were raised about everything from CryptPad’s business model, to future development plans, the broader state of open-source software, and the EU’s recent debates against encryption.

Aaron was also a guest on the We Don’t Stream podcast in an episode titled Imagine Google Docs but without the spyware. This was a general presentation of CryptPad, the origin story and the motivations behind it. Each guest is asked to nominate an NGO to encourage donations. Aaron chose the Electronic Frontier Foundation so please consider supporting them.

Ludovic Dubost was interviewed on Website Planet, answering questions about XWiki and CryptPad. The interview covers the founding of XWiki, the effects of the COVID crisis on the company and CryptPad, as well as the future of the open-source industry.

Ludovic also presented CryptPad at esLibre 2020, a Spanish open-source conference organised this year by King Juan Carlos University. Like many others this year this was a “virtual” event, the video is available on our PeerTube channel. Ludovic covered the background of the project and gave a demo-tour of CryptPad.

The outage of December 8th, 2020 - a postmortem

On December 8th, 2020 a malfunction in the water-cooling system at a data-center in Roubaix, France caused an unrecoverable error in the physical machine which hosts CryptPad.fr. The service was unavailable for approximately 27 hours while we diagnosed a corrupted OS, provisioned a new server, and migrated user data to the new system.

What happened

Our team works remotely across three different timezones, so for the sake of simplicity I’ll summarize the timeline of the service outage in Central European Time (CET) using a 24-hour clock.

December 8, 05:30 - Server update deployed

I occasionally start my working days very early in the morning when we have as few users connected as possible. On the days where a server restart is necessary, I do it at this time to minimize the number of active users that might be inconvenienced by the momentary service interruption.

During the process of deploying a minor patch to optimize how the service loads and evicts document metadata I noticed that our monthly full-disk backup was running. We run a less intensive incremental backup on a daily basis, but having a regular full-disk backup ensures that restoration does not become increasingly difficult over time.

We’ve been breaking our records for the highest ever number of concurrent users on CryptPad.fr on a regular basis, so I tend to pay close attention to how our server is performing and how small changes in our code affect its performance.

A record number of concurrently connected clients, as reported by our admin panel

My colleagues hadn’t started their shifts yet, but I left a message informing them that the server would probably be under more load than usual. We were considering also deploying an update to our client code, but we usually avoid doing so on backup days to help ensure that things go smoothly.

Me jinxing our server for the day

December 8, 12:42 - CryptPad goes down

CryptPad is developed by a company called XWiki. We host everything using virtual machines provisioned on the same dedicated physical servers as the rest of the company’s infrastructure. Performance metrics and monitoring for our other sites suggests that the host machine did not fail instantly, rather, its performance degraded over a relatively short period of time. Some services failed sooner as the host system tried to de-prioritize less critical systems.

The last line in our server’s log was written to the disk at 12:42. I was away from my desk to eat a meal, so I didn’t notice that anything had gone wrong. I returned to find some user reports sent at around 13:00 that the service had been unavailable for some time. We occasionally receive such reports that turn out to be user error (typos in URLs or DNS problems), but in this case it was easy to confirm as a systemic problem since my CryptPad tabs were also disconnected.

My first assumption was that an error in the the code I’d deployed earlier in the day that had caused the server to crash. I tried to log into our servers from my terminal with no success, then tried to ping the server’s domain name, then its raw IP address, at which point I realized that the machine was completely powered down or otherwise unreachable.

The last time we were surprised with this kind of outage was in November 2017 when a power outage and a generator failure took several data-centers completely offline. That outage lasted 3.5 hours, which seemed very bad at the time, but I was expecting something similar.

December 8, 13:35 - Infra is on it

Since the same physical infrastructure hosts a large number of sites the outage had been noticed by many of our company’s employees almost instantly. We have lots of monitoring in place to send warnings when things are performing poorly (or not at all), but I learned via our company’s internal chat service that at least one of our physical servers had had a critical failure and that Kevin (our resident infra expert) was working on it. CryptPad’s track record for uptime until this point was very good, and most of that was due to Kevin, so I tried to leave him alone so he could focus on diagnosing and possibly fixing the problem. Since there didn’t seem to be anything I could do on that front to help the situation I started to respond to the related GitHub issues and messages in our public chat channel to inform our users what was going on.

At this point I also noticed several messages from my colleagues congratulating me on my five-year anniversary at the company. I’d forgotten the date, and grateful as I was for the wishes, this wasn’t how I’d expected to celebrate the milestone.

December 8, 15:00 - Host machine comes back online

By 15:00 the physical server that had gone offline had been powered back up. This meant that VMware (the software we use to host many virtual machines on one very powerful machine) was running again, though some more work was needed to bring many of its hosted VMs back online. Kevin immediately began running a range of integrity checks to confirm that the hardware was functioning correctly before relaunching services. Some VMs that required fewer resources were able to be re-launched very quickly, but CryptPad requires more storage than most of the wikis we host, and disk checks tend to require more time than other diagnostics.

At 15:40 these disk integrity checks were interrupted when one of the data-center technicians (who I’m sure was also having a bad day) had to take the server back offline to transfer our hardware to a new location in the same building. Access was restored just a few minutes later, but we had to restart our integrity checks.

December 8, 16:20 - First disk integrity check completes

Forty minutes after the manual intervention, the first of three disk checkups had completed. VMware was reporting that all systems were operational, however, the VM that usually hosts our API server was failing to boot. Kevin was able to launch a Debian rescue system from a live disk and mount the system for inspection, but there was still no obvious indication why the system wouldn’t boot. He proceeded to launch checks for the remaining two disks while he continued to search for the cause of the failure.

December 8, 18:30 - Initial failure traced to a cooling malfunction

Throughout the duration of this downtime Kevin had been on and off the phone with the data-center technicians getting updates about what had happened and whether we should expect any further problems. By 18:30 we were informed of the cooling system’s malfunction. While it was somewhat comforting to know that the problem had nothing to do with code we’d written, it was also frustrating to be reminded that there will likely always be physical events like this that we can neither control nor predict.

As twitter user @RimaSghaier noted, the internet is still very physical.

December 8, 19:30 - File transfer commences

By 19:30, between myself, Kevin, Ludovic (the company’s CEO), and one of Ludovic’s friends who has some more experience with the intricacies of bootable filesystems, we’d made no progress diagnosing why the affected VM would not boot outside of the environment of the rescue disk. We had access to all the system’s files and all of the integrity checks had passed, but there seemed to be problem with the root filesystem. We decided that the safest thing to do was to provision a new VM and begin transferring the relevant files. We could interrupt the process if we discovered the reason for the failure, but it was already late in the day and we had no promising leads.

It took only a few minutes to provision a nearly identical VM and we immediately began transferring files via the data-center’s internal network. Unfortunately, there was around 750GB of data to transfer at a variable rate that did not seem very promising.

Until this point I’d been very hopeful that at any minute we would find some trick to get the original server back online. As it became increasingly apparent that this was unlikely and that we’d need to wait for the file transfer to finish we shifted our focus to damage control.

The API server that hosts our database and Nodejs server had been offline, but we actually serve our static assets (HTML, Javascript, etc) from a different machine that had stayed online. I’d been distracted by the actual system outage and hadn’t thought to update our front-end to inform all our users of what was going on, though I had been posting to our Mastodon and Twitter accounts.

I hacked together and deployed some very basic HTML as quickly as I could, explaining what was happening and directing users to our social media for updates. This was deployed by 19:43.

CryptPad's down page

December 8, 20:00 - I try to get some sleep

Finally, after about 7 hours of downtime and a 14.5 hour shift on my part, we left the servers alone to continue their work and decided to get some rest for the following day. We expected the file transfer to take at least 10 more hours to complete, so I set my alarm for the following morning and called it a day.

December 9, 16:14

December 9th was not especially eventful. I spent most of the day idly monitoring the progress of the network file transfer. I was far too distracted to be productive with anything else, and anyway it seemed prudent to save my energy for when the transfer completed.

By about 13:30 the transfer was 90% complete and I began to pre-configure as much as possible on the new system so that we could bring everything back up as quickly as possible. I prepared and reviewed a list of final tasks with Kevin and Yann in the final 30 minutes of the transfer, and we started working as soon as it finished.

We were able to complete the system’s setup in around 20 minutes, including a last-minute configuration fix to restrict the service to our IP addresses before we launched it. This restriction allowed us to access CryptPad as normal before anyone else. We took about ten minutes to test the platform, loading any documents we’d been editing leading up to the crash and confirming that everything was behaving as expected.

Finally, by 16:14, after a bit more than 27 hours of downtime, we removed the IP address restriction and removed the downtime notice I’d deployed the evening before.

Difficulties and lessons learned

I’ll start by saying in very simple terms that this experience sucked. I know it was very frustrating for our users who couldn’t access their documents while the server was offline. I certainly had a terrible two shifts. It was stressful for everyone on our team, and I suspect it was similarly unpleasant for the data-center technicians as well.

It should be more obvious given the root meaning of the word internet, but we all depend on many systems functioning to maintain our daily routines. The majority of our users only contact us to report bugs. Kevin and I mostly end up chatting when one of us notices irregular server behaviour. We only contact OVH when our servers have problems, and they probably don’t deal too much with their municipal electricity and fuel providers except when their power goes out and they fall back to using generators. We are most aware of the systems that sustain us when they break.

On a positive note, though, I was pleasantly surprised by how understanding people were about the situation. One of our paying users cancelled their subscription, but it seems the outage served to remind many people that there are humans working on this project, and so we’ve actually seen an increase in the rate of donations and subscriptions in the week since. We greatly appreciate everyone’s generosity!

Some users seem to have understandably lost some confidence in our platform, as we’ve seen slightly fewer users at the usual peak hours (2700 concurrent connections instead of 3000). On the other hand, it seems like the downtime page led to a significant increase in our follower count on social media.

Many of our users rely on CryptPad as a persistent home for their documents, and in these cases downtime is very inconvenient. During the outage, however, I learned about this software which randomly redirects users to publicly hosted instances of open-source software platforms. If you use CryptPad as a place to collaborate rather than a place to store documents, then you could try cryptpad.random-redirect.de to find alternatives. If you host a CryptPad instance you could even inquire about adding your server to the list. One of the great things about open-source software is that failures that affect one server or service do not need to have global effects.

Despite the positive aspects of our community’s response to this event, I regret that it took so long migrate to a new machine. The simple fact is that while we (mostly Kevin and Ludovic) have put in a lot of effort to making sure that our hosting infrastructure is reliable, we were unprepared for the task of rapidly migrating our entire database to a new machine. We’re hosting about six times more data now than we were at the start of the year. Until now we’ve had little cause to consider the increasing difficulty of managing this growing dataset and with everything else that has happened this year there has been little opportunity to do so. This event made it abundantly clear that we’re going to have to find the required time.

What we plan to do

It would be an understatement to say that I have a bit of an idealist stance when it comes to software. This is why I work on open-source, privacy-preserving tech. It’s terrible that modern, web-based software is as fragile as it is. That said, the alternative of emailing static documents to colleagues (or between devices) also makes it difficult to be productive.

It’s a bit ridiculous that a broken cooling system in northern France can mean that our 20,000 daily active users lose the ability to edit or even read their documents for more than a day. More frustrating is the fact that we were very nearly in a good position to mitigate many of the adverse effects of this outage. We’ve been working on some new offline-first functionality in CryptPad over the last few months and, as noted above, we were considering deploying the first phase of these improvements the day of the outage.

Our first offline features were deployed yesterday as a part of our 3.25.0 release. Now, every time you load a document in CryptPad you’re also populating an advanced cache in your browser. For now this only has the effect of reducing the total time to load cached documents, since we still wait for confirmation from the server that this is the most recent version of the content before removing the loading screen.

Our next step will be to merge a branch of our code which will instead load and display the last known state of any document in your local cache in offline mode, regardless of whether you’re able to reach our database server. This would have alleviated some of the inconvenience of our outage, since users were still able to load the platform’s HTML and JavaScript that would have at least let them access cached documents.

The next major feature will be the use of service-workers to enable browsers to use very advanced caching policies and load our client-side code even while entirely offline, allowing full access to cached documents under almost any circumstance. We expect to deploy these updates in early January 2021 as a part of our upcoming 4.0.0 release.

One of my favourite academic papers defines the term gray failures, in which well-intentioned attempts to introduce redundancy into online systems can paradoxically increases the likelihood of service degradation or interruptions. In the last few weeks both Amazon and Google (some of the richest companies on the planet, in case you haven’t heard of them) have experienced severe service outages. There are very few easy answers in this area, but we’re going to learn from this situation and work on solutions that would have helped at least let us recover more quickly.

If the next data-center failure happens in another three years I hope it will only last a small fraction of the time, and that our software will be so resilient you’ll hardly notice. In the meantime our team greatly appreciates all your support!