component.py 27.9 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Copyright (C) 2011  Internet Systems Consortium, Inc. ("ISC")
#
# Permission to use, copy, modify, and distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND INTERNET SYSTEMS CONSORTIUM
# DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL
# INTERNET SYSTEMS CONSORTIUM BE LIABLE FOR ANY SPECIAL, DIRECT,
# INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
# FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
# NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
# WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

16 17 18 19 20 21 22
"""
Module for managing components (abstraction of process). It allows starting
them in given order, handling when they crash (what happens depends on kind
of component) and shutting down. It also handles the configuration of this.

Dependencies between them are not yet handled. It might turn out they are
needed, in that case they will be added sometime in future.
23 24

This framework allows for a single process to be started multiple times (by
Michal 'vorner' Vaner's avatar
Michal 'vorner' Vaner committed
25 26 27
specifying multiple components with the same configuration). We might want
to add a more convenient support (like providing a count argument to the
configuration). This is yet to be designed.
28 29
"""

30
import isc.log
31
from isc.log_messages.bind10_messages import *
32
import time
33 34
import os
import signal
35 36

logger = isc.log.Logger("boss")
37 38
DBG_TRACE_DATA = 20
DBG_TRACE_DETAILED = 80
39

40 41
START_CMD = 'start'
STOP_CMD = 'stop'
42

43
STARTED_OK_TIME = 10
44
COMPONENT_RESTART_DELAY = 10
45

46 47 48 49
STATE_DEAD = 'dead'
STATE_STOPPED = 'stopped'
STATE_RUNNING = 'running'

50 51 52 53 54 55 56 57
def get_signame(signal_number):
    """Return the symbolic name for a signal."""
    for sig in dir(signal):
        if sig.startswith("SIG") and sig[3].isalnum():
            if getattr(signal, sig) == signal_number:
                return sig
    return "unknown signal"

58
class BaseComponent:
59
    """
60 61 62 63
    This represents a single component. This one is an abstract base class.
    There are some methods which should be left untouched, but there are
    others which define the interface only and should be overriden in
    concrete implementations.
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90

    The component is in one of the three states:
    - Stopped - it is either not started yet or it was explicitly stopped.
      The component is created in this state (it must be asked to start
      explicitly).
    - Running - after start() was called, it started successfully and is
      now running.
    - Dead - it failed and can not be resurrected.

    Init
      |            stop()
      |  +-----------------------+
      |  |                       |
      v  |  start()  success     |
    Stopped --------+--------> Running <----------+
                    |            |                |
                    |failure     | failed()       |
                    |            |                |
                    v            |                |
                    +<-----------+                |
                    |                             |
                    |  kind == dispensable or kind|== needed and failed late
                    +-----------------------------+
                    |
                    | kind == core or kind == needed and it failed too soon
                    v
                  Dead
91 92 93 94 95

    Note that there are still situations which are not handled properly here.
    We don't recognize a component that is starting up, but not ready yet, one
    that is already shutting down, impossible to stop, etc. We need to add more
    states in future to handle it properly.
96
    """
97
    def __init__(self, boss, kind):
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
        """
        Creates the component in not running mode.

        The parameters are:
        - `boss` the boss object to plug into. The component needs to plug
          into it to know when it failed, etc.
        - `kind` is the kind of component. It may be one of:
          * 'core' means the system can't run without it and it can't be
            safely restarted. If it does not start, the system is brought
            down. If it crashes, the system is turned off as well (with
            non-zero exit status).
          * 'needed' means the system is able to restart the component,
            but it is vital part of the service (like auth server). If
            it fails to start or crashes in less than 10s after the first
            startup, the system is brought down. If it crashes later on,
113
            it is restarted (see below).
114 115 116
          * 'dispensable' means the component should be running, but if it
            doesn't start or crashes for some reason, the system simply tries
            to restart it and keeps running.
117 118 119 120 121 122 123

        For components that are restarted, the restarts are not always
        immediate; if the component has run for more than
        COMPONENT_RESTART_DELAY (10) seconds, they are restarted right
        away. If the component has not run that long, the system waits
        until that time has passed (since the last start) until the
        component is restarted.
124 125 126 127 128 129 130 131

        Note that the __init__ method of child class should have these
        parameters:

        __init__(self, process, boss, kind, address=None, params=None)

        The extra parameters are:
        - `process` - which program should be started.
132
        - `address` - the address on message bus, used to talk to the
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
           component.
        - `params` - parameters to the program.

        The methods you should not override are:
        - start
        - stop
        - failed
        - running

        You should override:
        - _start_internal
        - _stop_internal
        - _failed_internal (if you like, the empty default might be suitable)
        - name
        - pid
        - kill
149
        """
150 151
        if kind not in ['core', 'needed', 'dispensable']:
            raise ValueError('Component kind can not be ' + kind)
152
        self.__state = STATE_STOPPED
153
        self._kind = kind
154
        self._boss = boss
155
        self._original_start_time = None
156

157 158
    def start(self):
        """
159 160
        Start the component for the first time or restart it. It runs
        _start_internal to actually start the component.
161 162 163

        If you try to start an already running component, it raises ValueError.
        """
164
        if self.__state == STATE_DEAD:
165
            raise ValueError("Can't resurrect already dead component")
166 167
        if self.running():
            raise ValueError("Can't start already running component")
168
        logger.info(BIND10_COMPONENT_START, self.name())
169
        self.__state = STATE_RUNNING
170
        self.__start_time = time.time()
171 172
        if self._original_start_time is None:
            self._original_start_time = self.__start_time
Jelte Jansen's avatar
Jelte Jansen committed
173
        self._restart_time = None
174
        try:
175
            self._start_internal()
176 177
        except Exception as e:
            logger.error(BIND10_COMPONENT_START_EXCEPTION, self.name(), e)
178
            self.failed(None)
179
            raise
180

181 182
    def stop(self):
        """
183 184
        Stop the component. It calls _stop_internal to do the actual
        stopping.
185 186 187 188

        If you try to stop a component that is not running, it raises
        ValueError.
        """
189 190
        # This is not tested. It talks with the outher world, which is out
        # of scope of unittests.
191 192
        if not self.running():
            raise ValueError("Can't stop a component which is not running")
193
        logger.info(BIND10_COMPONENT_STOP, self.name())
194
        self.__state = STATE_STOPPED
195
        self._stop_internal()
196

197
    def failed(self, exit_code):
198 199 200 201 202
        """
        Notify the component it crashed. This will be called from boss object.

        If you try to call failed on a component that is not running,
        a ValueError is raised.
203 204 205 206 207 208

        If it is a core component or needed component and it was started only
        recently, the component will become dead and will ask the boss to shut
        down with error exit status. A dead component can't be started again.

        Otherwise the component will try to restart.
209 210

        The exit code is used for logging. It might be None.
211

Jeremy C. Reed's avatar
Jeremy C. Reed committed
212
        It calls _failed_internal internally.
213 214 215 216 217

        Returns True if the process was immediately restarted, returns
                False is the process was not restarted, either because
                it is considered a core or needed component, or because
                the component is to be restarted later.
218
        """
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240

        if exit_code is not None:
            if os.WIFEXITED(exit_code):
                exit_str = "process exited normally with exit status %d" % (exit_code)
            elif os.WIFCONTINUED(exit_code):
                exit_str = "process continued with exit status %d" % (exit_code)
            elif os.WIFSTOPPED(exit_code):
                sig = os.WSTOPSIG(exit_code)
                signame = get_signame(sig)
                exit_str = "process stopped with exit status %d (killed by signal %d: %s)" % (exit_code, sig, signame)
            elif os.WIFSIGNALED(exit_code):
                if os.WCOREDUMP(exit_code):
                    exit_str = "process dumped core with exit status %d" % (exit_code)
                else:
                    sig = os.WTERMSIG(exit_code)
                    signame = get_signame(sig)
                    exit_str = "process terminated with exit status %d (killed by signal %d: %s)" % (exit_code, sig, signame)
            else:
                exit_str = "unknown condition with exit status %d" % (exit_code)
        else:
            exit_str = "unknown condition"

241
        logger.error(BIND10_COMPONENT_FAILED, self.name(), self.pid(),
242
                     exit_str)
243 244
        if not self.running():
            raise ValueError("Can't fail component that isn't running")
245
        self.__state = STATE_STOPPED
246
        self._failed_internal()
247 248
        # If it is a core component or the needed component failed to start
        # (including it stopped really soon)
249
        if self._kind == 'core' or \
250
            (self._kind == 'needed' and time.time() - STARTED_OK_TIME <
251
             self._original_start_time):
252
            self.__state = STATE_DEAD
253
            logger.fatal(BIND10_COMPONENT_UNSATISFIED, self.name())
254
            self._boss.component_shutdown(1)
255
            return False
256 257
        # This means we want to restart
        else:
258 259 260
            # if the component was only running for a short time, don't
            # restart right away, but set a time it wants to restarted,
            # and return that it wants to be restarted later
261 262 263 264 265
            self.set_restart_time()
            return self.restart()

    def set_restart_time(self):
        """Calculates and sets the time this component should be restarted.
266 267 268
           Currently, it uses a very basic algorithm; start time +
           RESTART_DELAY (10 seconds). This algorithm may be improved upon
           in the future.
269
        """
270
        self._restart_at = self.__start_time + COMPONENT_RESTART_DELAY
271 272 273 274 275

    def get_restart_time(self):
        """Returns the time at which this component should be restarted."""
        return self._restart_at

276
    def restart(self, now = None):
Jelte Jansen's avatar
Jelte Jansen committed
277 278
        """Restarts the component if it has a restart_time and if the value
           of the restart_time is smaller than 'now'.
279 280 281 282

           If the parameter 'now' is given, its value will be used instead
           of calling time.time().

Jelte Jansen's avatar
Jelte Jansen committed
283
           Returns True if the component is restarted, False if not."""
284 285
        if now is None:
            now = time.time()
Jelte Jansen's avatar
Jelte Jansen committed
286 287
        if self.get_restart_time() is not None and\
           self.get_restart_time() < now:
288
            self.start()
289 290 291
            return True
        else:
            return False
292

293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333
    def running(self):
        """
        Informs if the component is currently running. It assumes the failed
        is called whenever the component really fails and there might be some
        time in between actual failure and the call, so this might be
        inaccurate (it corresponds to the thing the object thinks is true, not
        to the real "external" state).

        It is not expected for this method to be overriden.
        """
        return self.__state == STATE_RUNNING

    def _start_internal(self):
        """
        This method does the actual starting of a process. You need to override
        this method to do the actual starting.

        The ability to override this method presents some flexibility. It
        allows processes started in a strange way, as well as components that
        have no processes at all or components with multiple processes (in case
        of multiple processes, care should be taken to make their
        started/stopped state in sync and all the processes that can fail
        should be registered).

        You should register all the processes created by calling
        self._boss.register_process.
        """
        pass

    def _stop_internal(self):
        """
        This is the method that does the actual stopping of a component.
        You need to provide it in a concrete implementation.

        Also, note that it is a bad idea to raise exceptions from here.
        Under such circumstance, the component will be considered stopped,
        and the exception propagated, but we can't be sure it really is
        dead.
        """
        pass

334
    def _failed_internal(self):
335 336 337 338
        """
        This method is called from failed. You can replace it if you need
        some specific behaviour when the component crashes. The default
        implementation is empty.
339 340 341

        Do not raise exceptions from here, please. The propper shutdown
        would have not happened.
342 343
        """
        pass
344

345
    def name(self):
346
        """
347 348
        Provides human readable name of the component, for logging and similar
        purposes.
349

350
        You need to provide this method in a concrete implementation.
351
        """
352
        pass
353

354
    def pid(self):
355
        """
356 357 358 359 360 361 362 363 364
        Provides a PID of a process, if the component is real running process.
        This may return None in cases when there's no process involved with the
        component or in case the component is not started yet.

        However, it is expected the component preserves the pid after it was
        stopped, to ensure we can log it when we ask it to be killed (in case
        the process refused to stop willingly).

        You need to provide this method in a concrete implementation.
365
        """
366
        pass
367

368
    def kill(self, forceful=False):
369
        """
370
        Kills the component.
371

372 373 374
        If forcefull is true, it should do it in more direct and aggressive way
        (for example by using SIGKILL or some equivalent). If it is false, more
        peaceful way should be used (SIGTERM or equivalent).
375

376 377 378
        You need to provide this method in a concrete implementation.
        """
        pass
379

380 381 382 383 384 385 386 387 388 389 390 391 392
class Component(BaseComponent):
    """
    The most common implementation of a component. It can be used either
    directly, and it will just start the process without anything special,
    or slightly customised by passing a start_func hook to the __init__
    to change the way it starts.

    If such customisation isn't enough, you should inherit BaseComponent
    directly. It is not recommended to override methods of this class
    on one-by-one basis.
    """
    def __init__(self, process, boss, kind, address=None, params=None,
                 start_func=None):
393
        """
394
        Creates the component in not running mode.
395

396 397 398 399
        The parameters are:
        - `process` is the name of the process to start.
        - `boss` the boss object to plug into. The component needs to plug
          into it to know when it failed, etc.
400 401
        - `kind` is the kind of component. Refer to the documentation of
          BaseComponent for details.
402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
        - `address` is the address on message bus. It is used to ask it to
            shut down at the end. If you specialize the class for a component
            that is shut down differently, it might be None.
        - `params` is a list of parameters to pass to the process when it
           starts. It is currently unused and this support is left out for
           now.
        - `start_func` is a function called when it is started. It is supposed
           to start up the process and return a ProcInfo object describing it.
           There's a sensible default if not provided, which just launches
           the program without any special care.
        """
        BaseComponent.__init__(self, boss, kind)
        self._process = process
        self._start_func = start_func
        self._address = address
        self._params = params
        self._procinfo = None

    def _start_internal(self):
421
        """
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
        You can change the "core" of this function by setting self._start_func
        to a function without parameters. Such function should start the
        process and return the procinfo object describing the running process.

        If you don't provide the _start_func, the usual startup by calling
        boss.start_simple is performed.
        """
        # This one is not tested. For one, it starts a real process
        # which is out of scope of unit tests, for another, it just
        # delegates the starting to other function in boss (if a derived
        # class does not provide an override function), which is tested
        # by use.
        if self._start_func is not None:
            procinfo = self._start_func()
        else:
            # TODO Handle params, etc
            procinfo = self._boss.start_simple(self._process)
        self._procinfo = procinfo
        self._boss.register_process(self.pid(), self)
441

442
    def _stop_internal(self):
443
        self._boss.stop_process(self._process, self._address, self.pid())
444 445
        # TODO Some way to wait for the process that doesn't want to
        # terminate and kill it would prove nice (or add it to boss somewhere?)
446

447 448 449
    def name(self):
        """
        Returns the name, derived from the process name.
450
        """
451 452 453 454 455 456
        return self._process

    def pid(self):
        return self._procinfo.pid if self._procinfo is not None else None

    def kill(self, forcefull=False):
457
        if self._procinfo is not None:
458
            if forcefull:
459
                self._procinfo.process.kill()
460
            else:
461
                self._procinfo.process.terminate()
462

463 464 465 466 467
class Configurator:
    """
    This thing keeps track of configuration changes and starts and stops
    components as it goes. It also handles the inital startup and final
    shutdown.
468 469 470 471

    Note that this will allow you to stop (by invoking reconfigure) a core
    component. There should be some kind of layer protecting users from ever
    doing so (users must not stop the config manager, message queue and stuff
472 473
    like that or the system won't start again). However, if a user specifies
    b10-auth as core, it is safe to stop that one.
474 475 476 477 478

    The parameters are:
    * `boss`: The boss we are managing for.
    * `specials`: Dict of specially started components. Each item is a class
      representing the component.
479 480 481 482 483

    The configuration passed to it (by startup() and reconfigure()) is a
    dictionary, each item represents one component that should be running.
    The key is an unique identifier used to reference the component. The
    value is a dictionary describing the component. All items in the
484
    description is optional unless told otherwise and they are as follows:
485 486 487
    * `special` - Some components are started in a special way. If it is
      present, it specifies which class from the specials parameter should
      be used to create the component. In that case, some of the following
488
      items might be irrelevant, depending on the special component chosen.
489 490 491 492 493
      If it is not there, the basic Component class is used.
    * `process` - Name of the executable to start. If it is not present,
      it defaults to the identifier of the component.
    * `kind` - The kind of component, either of 'core', 'needed' and
      'dispensable'. This specifies what happens if the component fails.
494
      This one is required.
495 496 497 498 499 500 501 502 503
    * `address` - The address of the component on message bus. It is used
      to shut down the component. All special components currently either
      know their own address or don't need one and ignore it. The common
      components should provide this.
    * `params` - The command line parameters of the executable. Defaults
      to no parameters. It is currently unused.
    * `priority` - When starting the component, the components with higher
      priority are started before the ones with lower priority. If it is
      not present, it defaults to 0.
504
    """
505
    def __init__(self, boss, specials = {}):
506 507 508 509 510
        """
        Initializes the configurator, but nothing is started yet.

        The boss parameter is the boss object used to start and stop processes.
        """
511 512 513
        self.__boss = boss
        # These could be __private, but as we access them from within unittest,
        # it's more comfortable to have them just _protected.
514 515

        # They are tuples (configuration, component)
516 517
        self._components = {}
        self._running = False
518
        self.__specials = specials
519

520 521 522 523 524 525
    def __reconfigure_internal(self, old, new):
        """
        Does a switch from one configuration to another.
        """
        self._run_plan(self._build_plan(old, new))

526 527 528 529 530 531
    def startup(self, configuration):
        """
        Starts the first set of processes. This configuration is expected
        to be hardcoded from the boss itself to start the configuration
        manager and other similar things.
        """
532 533 534
        if self._running:
            raise ValueError("Trying to start the component configurator " +
                             "twice")
535
        logger.info(BIND10_CONFIGURATOR_START)
536
        self.__reconfigure_internal(self._components, configuration)
537
        self._running = True
538 539 540 541

    def shutdown(self):
        """
        Shuts everything down.
542 543 544

        It is not expected that anyone would want to shutdown and then start
        the configurator again, so we don't explicitly make sure that would
545
        work. However, we are not aware of anything that would make it not
546
        work either.
547
        """
548 549 550
        if not self._running:
            raise ValueError("Trying to shutdown the component " +
                             "configurator while it's not yet running")
551
        logger.info(BIND10_CONFIGURATOR_STOP)
552
        self._running = False
553
        self.__reconfigure_internal(self._components, {})
554

555
    def reconfigure(self, configuration):
556 557
        """
        Changes configuration from the current one to the provided. It
558 559 560 561
        starts and stops all the components as needed (eg. if there's
        a component that was not in the original configuration, it is
        started, any component that was in the old and is not in the
        new one is stopped).
562
        """
563 564 565
        if not self._running:
            raise ValueError("Trying to reconfigure the component " +
                             "configurator while it's not yet running")
566
        logger.info(BIND10_CONFIGURATOR_RECONFIGURE)
567
        self.__reconfigure_internal(self._components, configuration)
568 569 570 571 572 573 574 575 576 577 578 579

    def _build_plan(self, old, new):
        """
        Builds a plan how to transfer from the old configuration to the new
        one. It'll be sorted by priority and it will contain the components
        (already created, but not started). Each command in the plan is a dict,
        so it can be extended any time in future to include whatever
        parameters each operation might need.

        Any configuration problems are expected to be handled here, so the
        plan is not yet run.
        """
580
        logger.debug(DBG_TRACE_DATA, BIND10_CONFIGURATOR_BUILD, old, new)
581
        plan = []
582 583 584
        # Handle removals of old components
        for cname in old.keys():
            if cname not in new:
585
                component = self._components[cname][1]
586 587
                if component.running():
                    plan.append({
588
                        'command': STOP_CMD,
589 590
                        'component': component,
                        'name': cname
591
                    })
592 593 594
        # Handle transitions of configuration of what is here
        for cname in new.keys():
            if cname in old:
595 596
                for option in ['special', 'process', 'kind', 'address',
                               'params']:
597
                    if new[cname].get(option) != old[cname][0].get(option):
598 599 600 601
                        raise NotImplementedError('Changing configuration of' +
                                                  ' a running component is ' +
                                                  'not yet supported. Remove' +
                                                  ' and re-add ' + cname +
Michal 'vorner' Vaner's avatar
Michal 'vorner' Vaner committed
602
                                                  ' to get the same effect')
603
        # Handle introduction of new components
604
        plan_add = []
605 606
        for cname in new.keys():
            if cname not in old:
607
                component_config = new[cname]
608
                creator = Component
609
                if 'special' in component_config:
610
                    # TODO: Better error handling
611 612 613 614 615 616
                    creator = self.__specials[component_config['special']]
                component = creator(component_config.get('process', cname),
                                    self.__boss, component_config['kind'],
                                    component_config.get('address'),
                                    component_config.get('params'))
                priority = component_config.get('priority', 0)
617 618 619
                # We store tuples, priority first, so we can easily sort
                plan_add.append((priority, {
                    'component': component,
620
                    'command': START_CMD,
621
                    'name': cname,
622
                    'config': component_config
623 624 625
                }))
        # Push the starts there sorted by priority
        plan.extend([command for (_, command) in sorted(plan_add,
626 627 628
                                                        reverse=True,
                                                        key=lambda command:
                                                            command[0])])
629
        return plan
630

631
    def running(self):
632 633 634 635
        """
        Returns if the configurator is running (eg. was started by startup and
        not yet stopped by shutdown).
        """
636 637
        return self._running

638 639
    def _run_plan(self, plan):
        """
640
        Run a plan, created beforehand by _build_plan.
641 642 643 644 645 646 647

        With the start and stop commands, it also adds and removes components
        in _components.

        Currently implemented commands are:
        * start
        * stop
648 649 650 651

        The plan is a list of tasks, each task is a dictionary. It must contain
        at last 'component' (a component object to work with) and 'command'
        (the command to do). Currently, both existing commands need 'name' of
652
        the component as well (the identifier from configuration). The 'start'
653
        one needs the 'config' to be there, which is the configuration description
654
        of the component.
655
        """
656 657 658 659 660 661
        done = 0
        try:
            logger.debug(DBG_TRACE_DATA, BIND10_CONFIGURATOR_RUN, len(plan))
            for task in plan:
                component = task['component']
                command = task['command']
662 663 664
                logger.debug(DBG_TRACE_DETAILED, BIND10_CONFIGURATOR_TASK,
                             command, component.name())
                if command == START_CMD:
665
                    component.start()
666 667
                    self._components[task['name']] = (task['config'],
                                                      component)
668
                elif command == STOP_CMD:
669 670 671 672
                    if component.running():
                        component.stop()
                    del self._components[task['name']]
                else:
673
                    # Can Not Happen (as the plans are generated by ourselves).
674 675 676 677 678 679
                    # Therefore not tested.
                    raise NotImplementedError("Command unknown: " + command)
                done += 1
        except:
            logger.error(BIND10_CONFIGURATOR_PLAN_INTERRUPTED, done, len(plan))
            raise